Data Mining Take Home Exam #1

Assigned on Monday, April 28.
Due in class on Friday, May 2.

Describe a situation where it would be desirable to have a classifier on the "left" end of an ROC curve, i.e. where both true positives and false positives are low.
Suppose that you have a dataset with points in one of two classes. Class A contains 10 points, and class B contains 1000 points. Suppose class A is our class of interest, on which we focus our precision and recall measurements.
(a) Describe a simple classifier that yields a recall of 100%, but has terrible precision. Estimate what the precision will be for this classifier.
(b) Describe a simple classifier that yields a precision of 100%, but has terrible recall. Estimate what the recall will be for this classifier.
Describe how you can modify the k-Nearest Neighbor algorithm, by adding an additional parameter, that would allow you to plot an ROC curve. (Hint: varying k isn't it). Draw a sample ROC curve, and point out what part of the curve corresponds to small values for your parameter, and what part corresponds to large values for your parameter.
Explain the difference between a database query and a data mining query.
For point P = (8,2,5) and MBRs R1(S,T) where S1 = [12,3,1] and T1 = [15,4,2] and R2(S2,T2) where S2 = [1,2,4] and T2 = [7,15,10], determine MINDIST(P,R1) , MINDIST(P,R2), MINMAXDIST(P,R1), and MINMAXDIST(P,R2).
(a) Provide an example where the MINDIST heuristic is better than the MINMAXDIST heuristic.
(b) Provide an example where the MINMAXDIST heuristic is better than the MINDIST heuristic.
In the RainForest paper, Gehrke et. al. say that RF-Read is not efficient for algorithms that do bottom-up pruning, unless the families at the pure leaf nodes are large. Why does the size of the families at the leaf nodes impact whether or not RF-Read is worth considering relative to RF-Write? Likewise, the paper also says that RF-Read might be a viable solution for algorithms that prune the tree top-down. Why does the direction of splitting affect the viability of RF-Read relative to RF-Write?
Estimating the size of a node's AVC-group to be the same as its parent can be overkill. Describe an example.
Suppose that you have reserved enough memory to store 1000 tuples from a dataset on disk that actually contains 5000 tuples and three features. Assume that every AVC-group for this dataset is of the same size. Assume that you have reserved separately a sufficiently large amount of memory to store two AVC-groups simultaneously, but not enough to store three. Assume that every feature in the dataset contains precisely two values, and that each optimal split separates the data into two equally sized pieces (or differ by one if the number of tuples is odd). Give a worst case estimate of how many tuple reads from disk and tuple writes to disk are necessary for each of the algorithms RF-Read, RF-Write, and RF-Hybrid-Simple.
The techniques for agglomerative clustering that Dunham describes in your textbook are different from that the one we covered in class, in that Dunham's algorithms use a threshold. Read section 5.4.1, and discuss how how the threshold differs from our approach. Specifically, give an example of a dataset and a threshold where Dunham's approach would differ from ours, assuming that both algorithms use the same measurements for distances between points and clusters.