Machine Learning and Data Mining

Machine Learning and Data Mining: Exam 1

Each problem is worth 10 points.

Describe a situation where it would be desirable to have a classifier on the "left" end of an ROC curve, i.e. where both true positives and false positives are low.
Suppose that you have a dataset with points in one of two classes. Class A contains 10 points, and class B contains 1000 points. Suppose class A is our class of interest, on which we focus our precision and recall measurements.
1. Describe a simple classifier that yields a recall of 100%, but has terrible precision. Estimate what the precision will be for this classifier.
2. Describe a simple classifier that yields a precision of 100%, but has terrible recall. Estimate what the recall will be for this classifier.
Describe how you can modify the k-Nearest Neighbor algorithm, by adding an additional parameter, that would allow you to plot an ROC curve. (Hint: varying k isn't it). Draw a sample ROC curve, and point out what part of the curve corresponds to small values for your parameter, and what part corresponds to large values for your parameter.
Construct an example in 2D consisting of a point P and two Minimum Bounding Rectangles where the MBR with smallest MINDIST does not contain the closest object to point P. Provide coordinates for the point and the MBRs, and compute MINDIST and MINMAXDIST from the point to each of the two MBRs.
For each of the three pruning criteria in the scalable K-nearest neighbor paper, provide a specific example (points and MBRs as necessary with coordinates) to show why they are valid and how they help speed up the process.
Explain the difference between k-medians and k-medoids classifiers. What advantages / disadvantages does each offer?
Measuring the success of clustering algorithms is typically more difficult than measuring the success of classification algorithms. Why? Describe two methods that can be used to measure clustering success. For each method, describe a situation where it may not provide desired results.
The second of the "Data Mining Desiderata" in the ScaleKM paper states that at any time during the evaluation of a data mining algorithm:
- a 'best' answer is always available
- status information on progress can be provided
- expected remaining time is known
What aspects of these three items of information are supported by ScaleKM and CURE? Explain.
Explain the differences between primary and secondary compression and why both are necessary in the ScaleKM algorithm. Specifically:
1. Suppose that neither primary or secondary compression were to occur. How would that impact the performance of ScaleKM?
2. Suppose that primary compression were done, but secondary compression were ignored. Provide an example of data (using diagrams or coordinates) that shows that undesirable results could occur.
Answer the following questions regarding the CURE algorithm.
1. What does alpha control?
2. When alpha = 0, what basic clustering algorithm that we discussed in class does CURE imitate? (hint: BIRCH or MST aren't the answers)
3. When alpha = 1, what basic clustering algorithm that we discussed in class does CURE imitate? (hint: BIRCH or MST aren't the answers)
4. What does c control?
5. Assuming alpha = 0, what basic algorithm does CURE imitate when c = 1?
6. Assuming alpha = 0. what basic algorithm does CURE imitate when c is equal to the number of points in the cluster?