Data Mining Take Home Exam #1
Assigned on Monday, April 28.
Due in class on Friday, May 2.
- Describe a situation where it would be desirable to have a
classifier on the "left" end of an ROC curve, i.e. where both true
positives and false positives are low.
- Suppose that you have a dataset with points in one of two
classes. Class A contains 10 points, and class B contains 1000 points.
Suppose class A is our class of interest, on which we focus our
precision and recall measurements.
(a) Describe a simple classifier that yields a recall of 100%, but has
terrible precision. Estimate what the precision will be for this
classifier.
(b) Describe a simple classifier that yields a precision of 100%, but
has terrible recall. Estimate what the recall will be for this
classifier.
- Describe how you can modify the k-Nearest Neighbor algorithm, by
adding an additional parameter, that would allow you to plot an ROC
curve. (Hint: varying k isn't it). Draw a sample ROC curve, and point
out what part of the curve corresponds to small values for your
parameter, and what part corresponds to large values for your parameter.
- Explain the difference between a database query and a data mining
query.
- For point P = (8,2,5) and MBRs R1(S,T) where S1 = [12,3,1] and T1
= [15,4,2] and R2(S2,T2) where S2 = [1,2,4] and T2 = [7,15,10],
determine MINDIST(P,R1) , MINDIST(P,R2), MINMAXDIST(P,R1), and
MINMAXDIST(P,R2).
- (a) Provide an example where the MINDIST heuristic is better than
the MINMAXDIST heuristic.
(b) Provide an example where the MINMAXDIST heuristic is better than
the MINDIST heuristic.
- In the RainForest paper, Gehrke et. al. say that RF-Read is not
efficient for algorithms that do bottom-up pruning, unless the families
at the pure leaf nodes are large. Why does the size of the families at
the leaf nodes impact whether or not RF-Read is worth considering
relative to RF-Write? Likewise, the paper also says that RF-Read might
be a viable solution for algorithms that prune the tree top-down. Why
does the direction of splitting affect the viability of RF-Read
relative to RF-Write?
- Estimating the size of a node's AVC-group to be the same as its
parent can be overkill. Describe an example.
- Suppose that you have reserved enough memory to store 1000 tuples
from a dataset on disk that actually contains 5000 tuples and three
features. Assume that every AVC-group for this dataset is of the same
size. Assume that you have reserved separately a sufficiently large
amount of memory to store two AVC-groups simultaneously, but not enough
to store three. Assume that every feature in the dataset contains
precisely two values, and that each optimal split separates the data
into two equally sized pieces (or differ by one if the number of tuples
is odd). Give a worst case estimate of how many tuple reads from disk
and tuple writes to disk are necessary for each of the algorithms
RF-Read, RF-Write, and RF-Hybrid-Simple.
- The techniques for agglomerative clustering that Dunham describes
in your textbook are different from that the one we covered in class,
in that Dunham's algorithms use a threshold. Read section 5.4.1, and
discuss how how the threshold differs from our approach. Specifically,
give an example of a dataset and a threshold where Dunham's approach
would differ from ours, assuming that both algorithms use the same
measurements for distances between points and clusters.