CS 324: Agglomerative Clustering

For this clustering assignment, you will code up a version of agglomerative clustering.

This file contains tab delimited data of a version of the writing portfolio data that we looked at early in the term.

Your assignment is to implement agglomerative clustering, and run it on this dataset. Here are some specific things that you should and shouldn't do.

  1. To measure the distance between two clusters, use the Euclidean-squared distance between the centroids of those clusters (i.e., the mean). Handle a single point as a cluster of one point.

  2. The process of determining which clusters to merge next can be done via looking at all pairs every time, or via a smart process. The last part of section 8.3.1 in the Tan textbook discusses this, as does section 7.2.2 in the Mining Massive Datasets book. You can get most of the credit for this assignment if you want to avoid this whole efficiency approach and do a simpler search for nearest clusters, but some small fraction of the grading points will be reserved for this. Make sure to document very clearly what you're doing so that the graders can determine whether or not you're attempting a more efficient process than linear search. See the notes at the bottom of this page for more details on how to do this if you choose to pursue it.

  3. The column "HS Rank" has a mysterious value of 9999.99 if it is unknown. Make sure that you appropriately treat it, and blank values elsewhere in the dataset, as missing.

  4. Make sure that you standardize the data. In doing so, ignore all missing data to calculate the mean and standard deviation, then fill missing data in with zeros after standardizing.

  5. Plot the total SSE vs. number of clusters, for k varying from 1 to 20. In the writeup that you submit with your assignment, use both this plot and the cluster centers that you find to explain what you think the "best" value of k should be.

  6. Specifically for the choice of k=5, obtain the cluster centers. Provide these centers in your writeup, and interpret them. (This is to give the grader a fighting chance at trying to grade a consistently defined problem.)

Further notes on making the process efficient

If you want to pursue the last few points by efficiently implementing finding closest clusters, you'll want to use a priority queue / heap to maintain distances between all pairs of clusters. You'll then merge the pair at the top of the heap, i.e. with the smallest distance between them. When you do merge two clusters, you'll need to be able to find all other instances of those clusters in the heap to adjust their entries accordingly. The catch is that it naively takes O(n) work to search through the heap to find all other pairs that contain one of the cluster centers being updated, which defeats the whole purpose of using a heap. But if you implement the heap yourself (instead of using built-in approaches), you can also keep a separate dictionary of clusters that map to the indices in the heap where those clusters are. As you update the heap and move things around, update the dictionary.