Exam 3 guidelines

This is intended to give you a sense of what I think is important from the course so far, and what I will be thinking of when creating the exam.

I hate disclaimers, but here are some anyway. This is not a contract. I may have inadvertently left something off this list that ends up in an exam question. I make no guarantees that the exam will be 100% limited to items listed below. Moreover, I will not be able to test all of this material given the time limitations of the exam. I will have to pick and choose some subset of it.

BRING A CALCULATOR WITH YOU.

You are permitted one 8.5 x 11 sheet of paper with notes (both sides) for use as a reference during the exam.

Here are the specifics: Students should be able to...

Distinguish between supervised and unsupervised learning. Assess which is appropriate for a particular situation and/or dataset.

Demonstrate detailed understanding of clustering problem. Describe and interpret the goals of the clustering problem, both practical and mathematical.

Distinguish between the clustering problem and the algorithms used to solve it. Describe whether a particular algorithm solves the problem directly or only an approximation to it.

Describe k-means in detail, or apply k-means to a toy problem. Evaluate under what circumstances k-means is guaranteed to reduce (or not increase) error at each iteration. Connect different distance metrics with different variations on the algorithm. Evaluate impact of different tricks for picking initial seeds, and show capability of using and interpreting techniques such as refined cluster centers and k-means++.

Describe bisecting k-means and/or agglomerative clustering in detail, or apply to a toy problem. Compare with regular k-means in categories such as error reduction, effect of randomizing the order of the data, and connection with general clustering problem.

Explain challenges in choosing the correct number of clusters. Interpret "knee of the curve" and silhouette.

Explain challenges and describe approaches in evaluating quality of clusters.

Describe and/or calculate recommender system values using content-based techniques, or collaborative filtering-based techniques such as user-based-similarity or item-based similarity.

Distinguish between standardization and normalization as we have used them in both clustering and recommendaton systems, and describe when/where each is appropriate.


One last word: a fine way to practice is to do problems in the textbook that we haven't done. I wouldn't be surprised if I looked there for inspiration in writing text questions.