Replicating experiments in machine learning or natural language processing

Replicating experiments in machine learning or natural language processing and evaluating their robustness

Advisor: Anna Rafferty
Times: Fall 4a

Background

This 2020 paper examines the problem of automatically detecting whether a particular Twitter account is a real person or a bot. This problem is of increasing performance: automated accounts on social media can be used as part of campaigns to spread misinformation and influence public opinion and political actions. However, it's a challenging problem for computers to solve, as the goal of bot creators is often to make them seem human-like and many of them are crafted with the hope of evading automated content moderation. This paper takes the approach of examining features about the account, like how many followers it has, and using a machine learning algorithm called random forests to try to predict whether the account is a bot based on its features. To try to make their system perform better across a range of different Twitter accounts, they propose a new way of selecting what existing data the system should learn from. They then perform a number of different experiments to explore the accuracy of their system on different datasets and to examine what user features are most helpful for determining if an account is a bot.

This paper is one example of a common type of paper in machine learning and natural language processing: the paper identifies a problem, develops an algorithmic approach for solving that problem, and then tests it on one or more datasets. In that test of performance, the goal is typically to identify how well the proposed algorithm works versus alternative approaches and additionally to explore what kinds of examples one's algorithm can successfully classify versus what examples it makes errors on. Future research and applications often build on these experiments, relying on their results when deciding what algorithm is most appropriate for a new task or determining whether a new algorithm is better than existing work.

While this reliance on existing results is crucial in the scientific literature, it's less common to replicate other's work to confirm that the results are valid and evaluate the robustness of the results. The bot-detection paper gives one example of the kind of robustness we might care about: good performance on not just one dataset, but on a range of datasets. Testing the robustness of a machine learning or natural language process system can also include:

Testing how dependent the system's performance is on particular architecture choices or optimization methods, especially when these are not the main focus of the research.
Examining the system's performance on subsets of the data and quantifying the degree to which systematic differences in performance may indicative of bias. For example, in the Twitter data, we might examine whether users from different geographic regions are more or less likely to be incorrectly classified as bots.
Systematically removing parts of the model to assess which parts are crucial to good performance.

There's increasing concern in many scientific fields about the lack of attention to reproducibility and robustness of results. In psychology, alarms have been sounded that many purported psychological phenomena may be overblown, as some attempts to replicate them have been unsuccessful. In computer science, there have also been claims that "artificial intelligence faces a reproducibility crisis." Often, the details of experiments in published work are opaque, and sometimes important information for reproducing the work is not included. While the original authors may not have checked the robustness of their work, knowing how robust it is has important consequences: if the exact parameters used have major impacts on the results or the same approach on a different dataset produces very different results, then caution should be used when extrapolating from and building on the results. Without careful examination of whether a system has markedly different results for different subpopulations, published results may lead to deployment of systems that are in fact woefully inadequate for some populations, often perpetuating the marginalization of already marginalized groups.

The project

In this project, you'll learn about machine learning and/or natural language processing by replicating and/or evaluating the robustness of results from a paper published in the last 10-15 years in a top machine learning or natural language processing conference. In your replication, you'll try to duplicate the key experimental results from the original paper, starting either from scratch or from code published by the original authors. Additionally, you'll explore how robust these results are to changes, such as:

Does the same pattern of results hold if a different dataset is used?
How does removing parts of the system impact the overall results?
How much do the results vary across subgroups?

In your robustness investigation, you'll draw on the existing literature in deciding what to explore and in interpreting the outcome of your investigation. By replicating the work in the paper and examining its robustness, you'll both deeply engage with ideas in natural language processing and/or machine learning, and gain a better understanding of what it means for results to be both robust and generalizable, each of which are likely to be important for other applications and systems-building projects.

This project has a different structure than many computer science comps projects: you'll be working on comps during a single term (the fall), taking a total of six credits of comps that term. There will be roughly 18 students working on the project. At the beginning of the term, I'll offer several possible papers for students to work on, and we'll break down into smaller groups based on student interest. In these smaller teams of roughly 3-6 people, students will replicate one of the papers. We'll all meet together three times a week for a variety of activities, including learning about key aspects of machine learning experiments or using cloud computing resources that will be relevant across all the smaller teams, providing peer feedback to other teams, 1:1 meetings with me and individual teams, and project work time where I can help more informally. The goal of this one-term comps approach is to provide more frequent points of contact with the comps advisor, give extra structure to help students successfully navigate comps, and increase knowledge sharing across projects to in order to enhance the breadth of your learning in comps.

Deliverables

Your well-documented code for replicating the system in the paper, and scripts to run your experiments, organized in something like CodaLab that aims to increase reproducibility in computer science research.
A paper that describes what you did and your results. In your results, you'll describe how well you were able to replicate the original paper's findings and summarize your investigation of its robustness. Near the beginning of the term, your team and I will agree on the degree to which existing code will be used and correspondingly, the depth of robustness investigation that's expected.
(If time permits) Use Google scholar to examine what work has cited this paper, and explore whether the results have tended to be supported by later work. Your paper would summarize these findings, and also speak to whether the results of your experiments suggest any particular caution in interpreting or building on the results in the original paper.

Recommended experience

In this project, you'll be working on something related to machine learning and/or natural language processing. You don't need to have previous experience with this kind of work, but you should know it tends to be pretty algorithmic. Previous experience working with large datasets may be helpful but not necessary. Some courses that may be useful but are not required are Algorithms, Advanced Algorithms, Artificial Intelligence, Machine Learning, Data Science, Probability or Linear Algebra, and/or any courses in linguistics or statistics.

Example possible papers

There are lots of possible papers that you might end up replicating in this project. Here's a couple of examples of the types of papers that you might work with. Note that I'll be providing a list of papers to choose among at the beginning of the term; this list may or may not include these exact papers but will focus on similar topics:

Wang, Z., Lan, A. S., Nie, W., Waters, A. E., Grimaldi, P. J., & Baraniuk, R. G. (2018, June). QG-net: a data-driven question generation model for educational content. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (pp. 1-10).
Lots of research in the learning sciences suggests that you'll remember more if you answer questions about what you read than if you re-read it. But where do those questions come from? This paper uses a neural network model to automatically generate questions for freely available textbooks. The experiments both compare to other models for automatic question generation and do some robustness testing by examining the quality of questions based on the topic of the textbook.
Mahmood, A., Shafiq, Z., & Srinivasan, P. (2020). A Girl Has A Name: Detecting Authorship Obfuscation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2235-2245).
Authorship obfuscation methods aim to make it difficult to automatically identify who wrote a text by making stylistic changes to that text. This paper examines how stealthy authorship obfuscation methods are: beyond making it hard to identify the author of the text, do they also make it hard to automatically detect whether the text is a real text or one that's been modified? The paper trains a model to differentiate between texts that have been modified and those that have not been modified, demonstrating that while the obfuscation methods may be effective at preventing authorship attribution, they also leave clear evidence of the obfuscation. Beyond showing that the obfuscation can be detected, the paper explores how robust performance is to the ways that the texts are pre-processed.
Yang, J., She, D., Lai, Y. K., & Yang, M. H. (2018, April). Retrieving and classifying affective images via deep metric learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
Lots of algorithms for classifying images focus on identifying objects in the image, but what if rather than directly searching for an image about cats, I wanted to search for images that would make me happy? This paper presents one approach to this problem.
Chambers, N., & Jurafsky, D. (2011, June). Template-based information extraction without the templates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 976-986). Association for Computational Linguistics.
Ever wondered how to automatically extract information about events from text, such as finding out who was involved, what their roles were, and where the event took place? This is often done by having specific templates specifying what information is needed for different types of events. For instance, an election event might involve voters, a government, and a candidate; a system would know something about what type of entity (e.g., a person) should fill each of these roles, and then try to find them in the text. This paper proposes an alternative approach where the templates are not given, and instead tries to identify likely possibly templates from text.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631-1642).
The emotional content of a sentence often varies from beginning to end. For instance, take this excerpt from a review of The Martian: "The Martian won't please those expecting a dark, terrorizing thrill ride where the heroes are in constant peril, but it'll make the rest of us laugh and cheer, which is something sci-fi blockbusters don't do enough these days." There's negative emotion at points like "a dark, terrorizing thrill ride" but then also more positive points like "laugh and cheer." Figuring out how the emotional polarity, or sentiment, of words can be combined together to extract sentiments over longer spans of text, like phrases or sentences, is a challenging task. This paper proposes a neural network model and explores the use of this model, which takes into account word order, with previous methods that ignore the order of words.
Lindsey, R. V., Khajah, M., & Mozer, M. C. (2014). Automatic discovery of cognitive skills to improve the prediction of student learning. In Advances in Neural Information Processing Systems (pp. 1386-1394).
Websites like Khan Academy offer opportunities for students to practice mathematical problem solving, and similar systems are often used in middle and high school classrooms. Many of these systems try to predict student learning in order to provide students with problems that are neither too easy nor too hard. Such systems typically rely on knowing what skills are being targeted by each problem, but identifying the right granularity for specifying skills is difficult. This paper presents an approach to combining labels about skills from experts with labels automatically learned from student problem solving data.

Final Results 1

Final Results 2

Final Results 3

Final Results 4