CS 324: k-Nearest Neighbor Classifier

One of the main ways that nearest neighbors are used is for a k-NN (k-nearest neighbor) classifier, which attempts to classify an item based on the classifications of neighbors around it.

The dataset that you'll use for this assignment is the letter-recognition dataset from the UCI Machine Learning Archive.

Part 1

Standardize the data, ignoring the first column. Submit your standardized data, with the first column appended back on in some easy-to-read format, along with the code that you have used to generate it.

Submit your work via the Carleton Courses hand-in folder for this course, in a folder called knn1.

Part 2

Write a program to loop over each item in the dataset, using the above standardized version, and for each find its nearest neighbors for k = 1 through 7, and a few others based on your intuition. Make sure not to accidentally include the item itself as one of its nearest neighbors! Pretend that you don't know the correct classification of the item, and have the neighbors vote on what they think the classification should be.

Your program should take the value of k as a command line parameter, and when run, produce the following information:

Clarification added on 1/9/15, 4pm: For a multiclass problem, there are many different ways you can think about computing accuracy. One of them simply by calculating the number of items that you classify correctly, divided by the total number of items. That's the one you should use for this assignment, done via leave-one-out cross-validation. (A disadvantage of this approach is that it underemphasizes rare classes. That's not realy a problem for this dataset, that has roughly the same number of examples for each class.)

Your program should also allow for a choice of using Euclidean or Manhattan distance, again via a command line parameter. Here is a sample of how we will test your program when grading, for Python or Java:

pypy knn.py 3 euclidean
python knn.py 3 euclidean
javac Knn.java && java Knn 6 manhattan

The above shows two invocations, the first in Python, the second in Python via pypy, and the third in Java. Each uses a different k and distance metric. If you use a language other than Python or Java, it must run on the department lab computers, and you should provide careful instructions on how to run it. If you right in Python, my suggestion is to use pypy, as your code will run considerably faster. It shouldn't matter we use python or pypy to test your code apart from speed, but there may be subtle differences. Indicate in a comment at the top of the program which environment (python or pypy) you have been using. Note that you may want to avoid Python 3; later in the term, we'll be using the numpy and scipy libraries, and while they work great with pypy for Python 2.x, they're not ready for pypy for Python 3.x.

In addition to submitting your code, you should turn in a plot for Euclidean distance showing k on the horizontal access, and average accuracy on the vertical access. Do so similarly for Manhattan distance. Explain in a very short attached writeup with distance metric is better (if any is), and what value(s) of k are preferred. Also briefly comment on how successful your classifier ultimately was.

You'll need to handle ties, both regarding if more than one class wins the vote, but also in trying to pick precisely k neighbors. Choose an appropriate strategy, and indicate in your writeup how you handled these in your code.

I haven't tried this technique myself on this particular dataset, so I'm curious as to what you'll get.

Submit your work via the Carleton Courses hand-in folder for this course, in a folder called knn2.