2023–24 Projects:
Debugging code is hard. Figuring out how to fix compilation errors and runtime exceptions can be a significant challenge. Error messages can be difficult to understand. Even if the programmer understands the error, it may be entirely not-obvious what the correct fix is.
A variety of research projects have been undertaken that use historical archives of code (working and broken) to study both bugs and fixes with the goals of helping programmers identify their errors. Notably, HelpMeOut is a prototype system that was built to help programmers fix broken code by providing suggestions on how to fix it. Those suggestions come from a database of code examples collected by the IDE, where the HelpMeOut software tries to find similar problems to the one being experienced, and learning from the fix that was made.
The HelpMeOut paper gives a particular example of the following Java code, which results in an ArrayIndexOutOfBoundsException:
for (int i=0; i < 200; i++) { myArray[i] = 0; }
Based on an analysis of similar bugs from other projects, the system makes the following recommended fix:
for (int i=0; i < 200; i++) { if (i < myArray.length) { myArray[i] = 0; } }
The HelpMeOut project looks wonderfully promising. There are a few major challenges in using it:
For your project, you will implement a system that will help programmers fix bugs by making recommendations from previous code examples. A major challenge, and difference with the above-mentioned project, will be in how you find or generate a database of pre-existing code. Rather than collect it locally (which will result in a small amount of data), you will study and implement techniques for obtaining this data from already existing repositories. At the time that I'm writing this, I'm aware of two specific strategies that you might consider using, though others undoubtedly exist:
You'll get started by reading, analyzing, and presenting to each other the preexisting literature on code repository analysis and automated programming assistance. If you choose to go in the direction of using Blackbox data, you will need to obtain permission from the Blackbox project to access the data, and be willing to sign an agreement promising to keep the data itself confidential. Students will also need to do an IRB proposal to Carleton for permission to use use the Blackbox data.
After defining a programming language to focus on and a particular approach to use, you'll choose how to implement it. Perhaps this starts off as a command-line tool that supplements a command-line compiler; perhaps you will implement an addition to a GUI programming environment. Many popular IDEs support some form of plugins, so you may be able to instrument your tool as a plugin to one or more of those environments.
A major challenge in using the Blackbox data, as addressed above, is that such data is not licensed for public use. You can use it for research purposes, but your tool cannot display Blackbox code snippets as recommended fixes. However, even in work such as HelpMeOut, the software does not directly show code from the repository. Variables from the repository are renamed, for example, to match variable names that the programmer uses. If one limits the recommendations to only use keywords from the language and identifier names created by the programmer looking for help, the only information being shown from the Blackbox repository is the structure of the fix. Does this adequately preserve privacy as required by the Blackbox project? Honestly, I'm not sure. It would certainly be fine as a prototype tool intended for local demonstration purposes. If the tool worked well and if we chose to offering to make the tool more widely available, it would be a great opportunity for conversation with the BlueJ team.
If this tool can be made to work well with an existing repository, I believe that there is a reasonable chance that this may be of interest to the CS education research community. If there is time and if all goes well, we may write a research paper and submit it to a CS education research conference.
If we use the Blackbox data, students will need to write SQL database queries to access it; a working knowledge of SQL would be helpful.