CS 324: k-Nearest Neighbor Classifier

One of the main ways that nearest neighbors are used is for a k-NN (k-nearest neighbor) classifier, which is attempts to classify an item based on the classifications of neighbors around it.

The dataset that you'll use for this assignment is some very old anonymized data regarding Carleton's writing portfolio. You can find it on the department network in /Accounts/courses/cs324/writingportfolio.xlsx. The only reason I can even distribute this is because it is so old and it is anonymized; nonetheless, please do not distribute it any further. It is in Excel, which is a way data is often distributed in the "real world," so you'll need to figure out how to get the data out and into your programming environment of choice. Once you've done so, throw away "Project ID" (for purposes of most of the exercise, but you'll need it later see below), and isolate the "needs work" column as separate from the rest of the data. This last column is your actual classification you are trying to predict: a value of 1 for "needs work" means the student was deemed to not have a successful writing portfolio.

Part 1

Standardize the remainder of the data (after ignoring the above two columns). Submit your standardized data, with "Project ID" and "needs work" columns appended back on, in some easy-to-read format, along with the code that you have used to generate it.

Part 2

Write a program to loop over each item in the dataset, using the above standardized version, and for each find its nearest neighbors for k = 1, 3, 5, 7, and a few others based on your intuition. Make sure not to accidentally include the item itself as one of its nearest neighbors! Pretend that you don't know the correct classification of the item (Needs Work or not), and have the neighbors vote on what they think the classification should be. Compare the classification you estimate to the actual classification, and determine the percentage that you actually get correct.

Try your technique for both Euclidean distance and Manhattan distance, and plot your results for each for a variety of values of k.

In order to assist the graders, your program should also print out the predicted classification (the results of the vote) for each project ID. Make it easy for the graders to pick a value of k and a distance metric (Euclidean or Manhattan distance) when they run your program.

I haven't tried this technique myself on this particular dataset, so I'm curious as to what you'll get. On one hand, it would be amazingly cool if you can get pretty good classification rates. On the other hand, if you can succeed at this with high accuracy, it might have scary implications regarding the relevance of the entire writing portfolio exercise. What will you find?

In addition to submitting your code, you should turn in the plots that you generate, as well as a short description of your results.