Advisor: Layla Oesper
The meteoric rise of big data of all forms has the potential to revolutionize how many important decisions are made. One area of public life where such data has the potential to greatly impact such decisions processes is in the criminal justice system. However, care must be taken to ensure that decisions based on data are not perpetuating human biases. For example, consider the following two real world scenarios.
A 2016 bombshell article by ProPublica analyzed how an algorithm called COMPAS was being used in the criminal justice system. In short, this algorithm was used to determine the likelihood that a defendant would become a repeat offender, and this score was being used by judges when determining sentences. The ProPublica article described how “the formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.”
On the other hand, data can also help to reveal biases within the criminal justice system. Analysis of New York City’s stop-and-frisk practices revealed that most of the time a person was stopped there was no need for further action. Furthermore, a large majority of the individuals stopped were either black or Hispanic. Analysis of this large dataset to identify factors that are correlated with further police action, but do not perpetuate biases, could be very useful.
In this project we are going to analyze the COMPAS dataset and the stop-and-frisk dataset. In particular, we are going to focus on how these datasets can be used for “bias-free” classification. Classification takes a set of data points (e.g. defendants) and assigns a label (or a class) to each data point (e.g. low risk, medium risk, high risk). Specifically, you will:
Interest in how big data analysis can be incorporated in a “fair manner” into the criminal justice system. Experience with machine learning or classification algorithms (as seen in CS32X courses) may be helpful, but is not required.
Below are a few papers about existing work analyzing these datasets. These are only intended to provide you a minimal start for your literature search - they are certainly not the only nor necessarily the best sources for ideas. You will be finding and reading many additional papers and articles!
Monday/Wednesday 3:10pm - 4:10pm (6a) for Fall and Winter