Replicating experiments in machine learning or natural language processing

Final Results

Replicating experiments in machine learning or natural language processing

Advisor: Anna Rafferty

Times: T/Th 8:25-10:25AM [You'll typically meet with me one day a week, and the meetings with me will be about ~40 minutes after roughly week 3.]

Background

This 2015 paper aims to address the problem of automatically detecting that a tweet is sarcastic. This is something humans are good at: if you saw a tweet that read "Being stranded in traffic is the best way to start my week," you wouldn't need the hashtag #sarcastic to know that the person was not in fact excited about traffic. However, it's a problem that can be tough for computers, as it seems to require deep understanding about the meanings of words and sentences and the context in which the statement is made. The paper's approach involves taking linguistic theories about sarcasm, turning them into computational features, and then using machine learning to identify the importance of the different features. To test the approach, experiments were conducted on three datasets, and they showed a significant improvement in accuracy over alternate approaches.

This paper is one example of a common type of paper in machine learning and natural language processing: the paper identifies a problem, develops an algorithmic approach for solving that problem, and then tests it on one or more datasets. In that test of performance, the goal is typically to identify how well the proposed algorithm works versus alternative approaches and additionally to explore what kinds of examples one's algorithm can successfully classify versus what examples it makes errors on. Future research and applications often build on these experiments, relying on their results when deciding what algorithm is most appropriate for a new task or determining whether a new algorithm is better than existing work. For instance, based on the paper above, one might conclude that to test if one has a better sarcasm detector, one need only compare against the new algorithm, since the older approach performed less well in their experiments. Yet, it's rare that people directly try to replicate other's work to confirm that the results are valid and evaluate whether the trends in the results hold in other datasets. In psychology, there has been concern in recent years that many purported psychological phenomena may be overblown, as some attempts to replicate them have been unsuccessful.

While computer science experiments are not the same as psychology experiments, there is still reason to be concerned about the lack of work focused on replicating computer science experiments. Often, the details of experiments in published work are opaque, and sometimes important information for reproducing the work in not included. Replicating previous work offers the opportunity to better understand that work, and to investigate the robustness of the algorithm to changes in parameters or dataset. If the exact parameters used have major impacts on the results or the same approach on a different dataset produces very different results, it suggests that caution should be used in generalizing the results and adding nuance to the original conclusions.

The project

In this project, you'll learn about machine learning and/or natural language processing by replicating a paper from the last 10-15 years that was published at one of the following conferences:

This project will start off with a group of roughly 12 students, and at the beginning of the term, we'll discuss what makes a good paper for this kind of project and I'll give you a list of suggested possible papers from these conferences. You'll each choose a paper to read and present over the first two or so weeks of the term. Then, based on your interests, you'll be broken into smaller teams of about four people, with each team replicating one of the papers that was presented in that first two weeks.

In your replication, you'll try to duplicate the key experimental results from the original paper. Additionally, you'll explore how robust these results are to changes: Does the same pattern of results hold if a slightly different dataset is used? How much do the results vary if slightly different parameter values are used or other minor changes are made? By replicating the work in the paper and investigating these questions, you'll both deeply engage with and learn about work in natural language processing and/or machine learning, and gain a better understanding of what it means for results to be both robust and generalizable, each of which are likely to be important for other applications and systems-building projects.

Deliverables

Your well-documented code for replicating the system in the paper, and scripts to run your experiments, organized in something like CodaLab that aims to increase reproducibility in computer science research.
A paper that describes what you did and your results. In your results, you'll describe how well you were able to replicate the original paper's findings and summarize your investigation of the robustness of the results to small changes in the data set and/or algorithm parameters.
(If time permits) Use Google scholar to examine what work has cited this paper, and explore whether the results have tended to be supported by later work. Your paper would summarize these findings, and also speak to whether the results of your experiments suggest any particular caution in interpreting or building on the results in the original paper.

Recommended experience

In this project, you'll be working on something related to machine learning and/or natural language processing. You don't need to have previous experience with this kind of work, but you should know it tends to be pretty algorithmic. Previous experience working with large datasets may be helpful but not necessary. Some courses that may be useful but are not required are Algorithms, Advanced Algorithms, Artificial Intelligence, Data Mining, Computational Models of Cognition, Data Science, Probability or Linear Algebra, and/or any courses in linguistics.

Example possible papers

There are lots of possible papers that you might end up replicating in this project. Here's a couple of examples of the types of papers that you might work with:

Yang, J., She, D., Lai, Y. K., & Yang, M. H. (2018, April). Retrieving and classifying affective images via deep metric learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
Lots of algorithms for classifying images focus on identifying objects in the image, but what if rather than directly searching for an image about cats, I wanted to search for images that would make me happy? This paper presents one approach to this problem.
Chambers, N., & Jurafsky, D. (2011, June). Template-based information extraction without the templates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 976-986). Association for Computational Linguistics.
Ever wondered how to automatically extract information about events from text, such as finding out who was involved, what their roles were, and where the event took place? This is often done by having specific templates specifying what information is needed for different types of events. For instance, an election event might involve voters, a government, and a candidate; a system would know something about what type of entity (e.g., a person) should fill each of these roles, and then try to find them in the text. This paper proposes an alternative approach where the templates are not given, and instead tries to identify likely possibly templates from text.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1631-1642).
The emotional content of a sentence often varies from beginning to end. For instance, take this excerpt from a review of The Martian: "The Martian won't please those expecting a dark, terrorizing thrill ride where the heroes are in constant peril, but it'll make the rest of us laugh and cheer, which is something sci-fi blockbusters don't do enough these days." There's negative emotion at points like "a dark, terrorizing thrill ride" but then also more positive points like "laugh and cheer." Figuring out how the emotional polarity, or sentiment, of words can be combined together to extract sentiments over longer spans of text, like phrases or sentences, is a challenging task. This paper proposes a neural network model and explores the use of this model, which takes into account word order, with previous methods that ignore the order of words.
Lindsey, R. V., Khajah, M., & Mozer, M. C. (2014). Automatic discovery of cognitive skills to improve the prediction of student learning. In Advances in Neural Information Processing Systems (pp. 1386-1394).
Websites like Khan Academy offer opportunities for students to practice mathematical problem solving, and similar systems are often used in middle and high school classrooms. Many of these systems try to predict student learning in order to provide students with problems that are neither too easy nor too hard. Such systems typically rely on knowing what skills are being targeted by each problem, but identifying the right granularity for specifying skills is difficult. This paper presents an approach to combining labels about skills from experts with labels automatically learned from student problem solving data.