CS 324: Clustering

For this assignment, you will program the k-means clustering algorithm, and try it in a dataset consisting of Wikipedia editing history.

This file contains tab delimited editing history for 3030 randomly selected English Wikipedia editors. Specifically, it contains the number of edits that each editor made in the year 2009 over four Wikipedia namespaces: "main" (actual articles), "talk" (talk about articles), "user" (user pages), and "user talk" (talk between users).

Part 1

Your assignment is to implement the k-means algorithm, and run it on this dataset. Here are some specific things that you should and shouldn't do. There are some questions within, indicated with underlines, that you should turn in answers to in a separate file.

One challenge with this data is that it is really, really, really skewed. The number of edits in each category more closely resembles a power law distribution than something normal or symmetric. In other words, there are tons of editors with 0 or just a few edits; then there are a small number of editors with very large numbers of edits. One way to see this is to sort the number of edits in main, and plot them (you can do this in R). You'll see mostly a small number of values, and a tail that shoots way up. This distribution is problematic because you get a relatively small number of points pulling the cluster centers dramatically unevenly. Moreover, it isn't fair to give such highly functioning editors such weight. Moreover, the difference between 10 edits and 20 edits should be considered to be much bigger than the difference between 1000 edits and 1020 edits.

Standardizing the data would seem to be a technique we have addressed for trying to handle this. But it's a terrible idea in this case. Explain why.

A different technique than standardization that is commonly used in cases such as this one is to log transform the data. In other words, take your entire dataset, and take the log of all points within. The base of the log doesn't matter; the old "change of base" formula shows that different logarithm bases simply multiply the result by a scalar. But to keep us all consistent, use base 2 for your logarithm. One detail here is that before you take the log of all of your values, you need to add 1 to everything; otherwise, you're taking log(0), which is undefined.

So to summarize the above: before clustering, replace every value x in your data with log_2(1+x). After done clustering, convert the values back to what they should be (and I'll let you work out the reverse algebra for that).
You'll need to initialize the points in your dataset. Do something appropriate. Implement choosing k random points, and something fancier. Allow the user to choose which technique to use when starting the program.
You should appropriately handle empty clusters. There's a wonderful neat and simple trick that handles this really well. At the end of each pass through your data, after you have assigned points to the nearest cluster center, check to see if any clusters are empty. If you have n empty clusters, find the n points in the dataset that are furthest from their cluster centers. Throw away any empty cluster centers and replace them with these. Reassign points to their closest centers. Repeat if necessary.

Prove that this procedure will reduce clustering error.
When your program runs, make sure it prints out the total SSE at each iteration, both before and after picking new cluster centers. This is useful for you and the graders to look at in diagnosing if things are working correctly.
Run your code for values for k running from 1 to 20. Plot the total clustering error vs. k. Use both this plot and the cluster centers that you find to explain what you think the "best" value of k should be.
Specifically for the choice of k=5, obtain the cluster centers. Provide these centers in your writeup, and interpret them. (This is to give the grader a fighting chance at trying to grade something specific.)
Use Euclidean squared as your distance metric.

Part 2

Your results for Part 1 depend heavily on the magnitude of the data. One might also be interested in clustering results based on how an editor's work distributes across namespaces relative to overall work. Therefore, rerun your results from Part 1, and do everything exactly as stated above again, but with the following modifications.

Normalize each row by dividing it by the Euclidean magnitude of the row (the square root of the sum of the squares of each value in the row). This has the effect of making the row (when though of as a vector) of having a Euclidean magnitude of 1. This also has the effect of giving it a squared Euclidean magnitude of 1, by the way.
Do not log transform your data this time. Since the magnitude of the data is limited and all editors are normalized, this is no longer necessary.
Euclidean squared is still your distance metric.

Part 3: (Added challenge, worth 1 additional point)

One unsettling thing that you should discover from Part 2 is that your cluster centers will not have magnitudes precisely equal to 1, despite the fact that all of the points in your dataset do. k-means on normalized data does not ensure that the cluster centers themselves come out normalized. That's sad. If you want to know more about this problem, and how to fix it, my student co-authors and I talk about it in this paper.

To obtain this challenge point, implement a version of your clustering code (as a separate program) that solves this problem.