Clustering

Computer Science Comps Project

Netflix Prize

Clustering

About Clustering:
Clustering is a type of unsupervised learning concerned with partitioning data into subsets. It is useful in a recommendation context because it identifies prototypical movies and can be used to identify noise.

Algorithms:
We implemented two clustering algorithms, kmeans and DBSCAN.

Kmeans is a prototype based clustering method. It randomly chooses initial centroids, then iteratively improves on them. Kmeans can be implemented in parallel, so we then distributed it across many machines.

DBSCAN is a density based clustering method. It calculates clusters by creating mutually density connected subsets of points. Because DBSCAN reduces to kNN, calculated kNN results offline, to make clustering run in linear time.

Cluster Results:
Here are some sample clusters.

<“Billy Madison”, “Happy Gilmore”>

<“Star Wars V”, “LOTR: RotK”,”LOTR: FotR”,”The Silence of the Lambs”,”Shrek”,” Caddyshack”,”Pulp Fiction”,” Full Metal Jacket”>

<“Star Wars II”,”Men In Black II”, “What Women Want”>

<“Family Guy: Vol 1”, “Family Guy: Freakin’ Sweet Collection”,”Futurama: Vol 1 – 4”>

<“2002 Olympic Figure Skating Competition”,” UFC 50: Ultimate Fighting Championship: The War of '04”>

<“Scorpions: A Savage Crazy World”, ”Metallica: Cliff 'Em All”,”Iron Maiden: Rock in Rio”,” Classic Albums: Judas Priest: British Steel”>

<“Blue Collar Comedy Tour: The Movie”,” Jeff Foxworthy: Totally Committed”, ”Bill Engvall: Here's Your Sign”,” Larry the Cable Guy: Git-R-Done”>

<“Beware! The Blob”,”They crawl”,” Aquanoids”,”The dead hate the living”>

<“The Girl who Shagged me”, ”Sports Illustrated Swimsuit Edition”, ”Sorority Babes in the Slimeball Bowl-O-Rama”, ”Forrest Gump: Bonus Material”>