Beginning programmers make a variety of mistakes when learning how to program. There have been a variety of CS education research projects done that work to better understand what these mistakes are, how frequently they happen, and how they impact the final program that students are trying to make. If we truly understood the different patterns of behavior that students make in coding, and which ones lead to success or difficulty, we could construct a whole new set of ideas in how to better teach students to code.
There have been a number of recent research projects that look to classify and quantify patterns in programming behavior by students. based on data which is collected during a programming session. Here are two examples, of which there are others out there:
Identifying different patterns that students follow in compiling and running code with and without errors is interesting in itself, but the larger goal is to understand how this information fits in with eventual success at getting a correct program. Are programmers who compile frequently more likely to end up succeeding? Do students who end up with the same error repeatedly end up more or less likely to succeed than students who jump around and try to fix different errors without having fixed one of them?
The biggest challenge in solving the above problem is in collecting data on student compilation and execution efforts, and determining student success. How do we know if a student has succeeded? Multiple approaches have been used:
A truly exciting and relatively recent development is the Blackbox Data Collection Project, which is part of the BlueJ educational IDE for Java programming. BlueJ is used by millions of people around the world in learning how to program in Java. The Blackbox project collects data from within BlueJ (with permission by users). Specifically, Blackbox collects anonymized source code that the user is writing, as well as telemetry on which buttons have been pressed. This dataset is incredibly useful for helping to understand the behaviors that students do within BlueJ, and how it interacts with the errors that they make. Using this data is much richer than working with data from a single institution. The goal for this project is to use Blackbox data to help answer the questions posed in the Setup above.
Here are some details as to how the project would proceed.
Literature search and administrative setup. You'll get started by reading, analyzing, and presenting to each other the preexisting literature on student code development analysis in general, and on Blackbox data in particular. Students will need to obtain permission from the Blackbox project to access the data, and be willing to sign an agreement promising to keep the data itself confidential. Students will also need to do an Institutional Review Board (IRB) proposal to Carleton for permission to work on the project.
Replication of existing research. Blackbox data has already been used to learn much about the programming behaviors and success patterns with regards to compilation. You'll replicate pre-existing analysis (which may or may not have been done on Blackbox data) to learn how to do it and to see if you can get consistent results.
Extension of existing research. Once the above is complete, you'll extend the analysis beyond compilation to measuring what characteristics in programming patterns lead to successful running code. Defining successful execution of code is challenging because that depends entirely on what each individual programmer is trying to do, which is generally unknown. However, a small but significant fraction of BlueJ work is undertaken by students who are using a textbook written by the BlueJ team. If you can identify which programs appear to be attempting those problems (and undoubtedly we can approximately guess this in a variety of ways), you can use unit testing to determine if the code does, in fact, run correctly.
Students will need to write SQL database queries in accessing the Blackbox data; a working knowledge of SQL would be helpful. For the data analysis in the project, some coursework in such techniques would be useful. That includes, to varying degrees, any of the CS 32x courses offered by the CS department, or any of a number of statistics courses offered by the math department.
I expect that different students on the project will have different skills, though a student who has none of the above skills (no experiences with databases or data analysis) may find this project challenging.