Data Mining Take Home Exam #2

Assigned on Friday, May 30.
Due in class on Wednesday, June 4.

1. The second of the "Data Mining Desiderata" in the ScaleKM paper states that at any time during the evaluation of a data mining algorithm:

a 'best' answer is always available
status information on progress can be provided
expected remaining time is known

What aspects of these three items of information are supported by ScaleKM, CURE, Apriori, and PageRank?

2. Measuring the success of clustering algorithms is typically more difficult than measuring the success of classification algorithms. Why? Describe two methods used in the ScaleKM paper that are used to measure clustering success. For each method, describe a situation where it may not provide desired results.

3. Answer the following questions regarding the role of the alpha parameter in the CURE algorithm.

What does alpha control?
When alpha = 0, what basic clustering algorithm that we discussed in class does CURE imitate? (hint: BIRCH or MST aren't the answers)
When alpha = 1, what basic clustering algorithm that we discussed in class does CURE imitate? (hint: BIRCH or MST aren't the answers)

4. Section 3.3 of the CURE paper derives the time complexity for the algorithm. The authors treat c as a fixed number, and so it does not enter into the results. Re-work this calculation treating c as a variable. Interpret your result by describing how the number of representative points affects the complexity of the algorithm.

5. Of the following statements, which are true and which are false? For those that are false, give a specific example with a dataset showing that they are false. For those that are true, prove it. Assume that all statements below should be read in the context of particular support and confidence thresholds. For example, the first rule below really says: If supp(X=>Y) >= s and conf(X=>Y) >= c, then supp(Y => X) >= s and conf(Y=>X) >= c.

a) If X => Y, then Y => X
b) If X => Y and Y => Z, then X => Z
c) If X => Z, then XY => Z
d) If XY => Z, then X=> Z

6. Find all frequent itemsets with minimum support of 50% using the Apriori algorithm on the following database:

TID	Items
1	K, A, D, B
2	D, A, C, E, B
3	C, A, B, E
4	B, A, D

7. Suppose that you have a web site with a number of pages that all link to each other so that you have no dangling links. You wish to improve your PageRank. Consider the following strategies:

Add links from your pages to high PageRank pages out on the web.
Remove links from your pages to high PageRank pages out on the web.
Add links from your pages to low PageRank pages out on the web.
Remove links from your pages to low PageRank pages out on the web.
Force webmasters of low PageRank pages out on the web to link to your pages.
Force webmasters of high PageRank pages out on the web to link to your pages.
Force webmasters of low PageRank pages out on the web to remove links to your pages.
Force webmasters of high PageRank pages out on the web to remove links to your pages.

For each of these strategies, explain whether it will help your PageRank or hurt your PageRank. Also describe how markedly these strategies change your PageRank relative to each other. Your answer should be based on the details of the technique, and not on how likely you are to actually influence other people to change their web pages on your demand. Assume you have global control of the web.