Data Mining Take Home Exam #2

Assigned on Friday, May 30.
Due in class on Wednesday, June 4.

1. The second of the "Data Mining Desiderata" in the ScaleKM paper states that at any time during the evaluation of a data mining algorithm:
What aspects of these three items of information are supported by ScaleKM, CURE, Apriori, and PageRank?

2. Measuring the success of clustering algorithms is typically more difficult than measuring the success of classification algorithms. Why? Describe two methods used in the ScaleKM paper that are used to measure clustering success. For each method, describe a situation where it may not provide desired results.

3. Answer the following questions regarding the role of the alpha parameter in the CURE algorithm.
4. Section 3.3 of the CURE paper derives the time complexity for the algorithm. The authors treat c as a fixed number, and so it does not enter into the results. Re-work this calculation treating c as a variable. Interpret your result by describing how the number of representative points affects the complexity of the algorithm.

5. Of the following statements, which are true and which are false?  For those that are false, give a specific example with a dataset showing that they are false. For those that are true, prove it. Assume that all statements below should be read in the context of particular support and confidence thresholds. For example, the first rule below really says: If supp(X=>Y) >= s and conf(X=>Y) >= c, then supp(Y => X) >= s and conf(Y=>X) >= c.

a) If X => Y, then Y => X
b) If X => Y and Y => Z, then X => Z
c) If X => Z, then XY => Z
d) If XY => Z, then X=> Z

6. Find all frequent itemsets with minimum support of 50% using the Apriori algorithm on the following database:

TID
Items
1
K, A, D, B
2
D, A, C, E, B
3
C, A, B, E
4
B, A, D

7. Suppose that you have a web site with a number of pages that all link to each other so that you have no dangling links. You wish to improve your PageRank. Consider the following strategies:
For each of these strategies, explain whether it will help your PageRank or hurt your PageRank. Also describe how markedly these strategies change your PageRank relative to each other. Your answer should be based on the details of the technique, and not on how likely you are to actually influence other people to change their web pages on your demand. Assume you have global control of the web.