Machine Learning and Data Mining: Agglomerative Clustering

Machine Learning and Data Mining Assignment: Agglomerative Clustering

Note to self: without shrinking, this didn't work well.

This time around, you will code up a version of agglomerative clustering. The approach is inspired by, but simpler than, the CURE approach.

You should use the same datasets that you used on the last assignment, and code up a version of agglomerative clustering with the following characteristics:

Distance between two points: Euclidean.
Distance between two clusters: Single link (closest distance between points, one from each cluster).

To make this efficient, you should use a heap as the CURE paper does to keep track of which clusters to merge next. The k-d tree that CURE uses is not necessary here, as we will not decide which points are representative and which are not. Instead, we will keep all points ("all points are representative"). This means that when you merge two clusters together, any other clusters that had either of these as the previous nearest will have the new cluster as the nearest.

I would encourage you not to code up your own heap, but rather to use one already available. Any of the languages that you are using should have libraries available.

As last time, you should vary the number of final clusters, and choose an appropriate number. Turn in on paper an explanation of the methodologies that you used, the results that you found, and a clear description of the final clusters that you discovered. Turn in your code via hsp.