CS 324: Agglomerative Clustering

Note to self: If you give this assignment again, make sure that students are aware that if they want it to be efficient, they need to code up their own heap. You use a heap for maintaining distances between all pairs of clusters. When you merge two clusters, you need to be able to find all find all others instances of those clusters in the heap to adjust their entries accordingly. The problem is that it takes O(n) to search through the heap for that, which defeats the whole purpose of using the heap. But if you implement the heap yourself, you can also keep a separate dictionary of pairs that map to the index in the heap where the value is. As you update the heap and move things around, update the dictionary.

For this clustering assignment, you will code up a version of agglomerative clustering.

This file contains tab delimited data of a version of the writing portfolio data.

Your assignment is to implement agglomerative clustering, and run it on this dataset. Here are some specific things that you should and shouldn't do.

  1. To measure the distance between two clusters, use the Euclidean-squared distance between the centroids of those clusters (i.e., the mean). Handle a single point as a cluster of one point.

  2. The process of determining which clusters to merge next can be done via looking at all pairs every time, or via a smart process. The last part of section 8.3.1 in the Tan textbook discusses this, as does section 7.2.2 in the Mining Massive Datasets book. You can get most of the credit for this assignment if you want to avoid this whole efficiency approach and do a simpler search for nearest clusters, but some small fraction of the grading points will be reserved for this. Make sure to document very clearly what you're doing so that the graders can determine whether or not you're attempting a more efficient process than linear search.

  3. The column "HS Rank" has a mysterious value of 9999.99 if it is unknown. Make sure that you appropriately treat it, and blank values elsewhere in the dataset, as missing.

  4. Make sure that you standardize the data. In doing so, ignore all missing data to calculate the mean and standard deviation, then fill missing data in with zeros after standardizing.

  5. Plot the total SSE vs. number of clusters, for k varying from 1 to 20. In the writeup that you submit with your assignment, use both this plot and the cluster centers that you find to explain what you think the "best" value of k should be.

  6. Specifically for the choice of k=5, obtain the cluster centers. Provide these centers in your writeup, and interpret them. (This is to give the grader a fighting chance at trying to grade a consistently defined problem.)