Data Mining Take Home Exam #2
Assigned on Friday, May 30.
Due in class on Wednesday, June 4.
1. The second of the "Data Mining Desiderata" in the ScaleKM paper
states that at any time during the evaluation of a data mining
algorithm:
- a 'best' answer is always available
- status information on progress can be provided
- expected remaining time is known
What aspects of these three items of information are supported by
ScaleKM, CURE, Apriori, and PageRank?
2. Measuring the success of clustering algorithms is typically more
difficult than measuring the success of classification algorithms.
Why? Describe two methods used in the ScaleKM paper that are used to
measure clustering success. For each method, describe a situation
where it may not provide desired results.
3. Answer the following questions regarding the role of the alpha
parameter in the CURE algorithm.
- What does alpha control?
- When alpha = 0, what basic clustering algorithm that we discussed
in class does CURE imitate? (hint: BIRCH or MST aren't the answers)
- When alpha = 1, what basic clustering algorithm that we discussed
in class does CURE imitate? (hint: BIRCH or MST aren't the answers)
4. Section 3.3 of the CURE paper derives the time complexity for the
algorithm. The authors treat c as a fixed number, and so it
does not enter into the results. Re-work this calculation treating c
as a variable. Interpret your result by describing how the number of
representative points affects the complexity of the algorithm.
5. Of the following statements, which are true and which are
false? For those that are false, give a specific example with a
dataset showing that they are false. For those that are true, prove it.
Assume that all statements below should be read in the context of
particular support and confidence thresholds. For example, the first
rule below really says: If supp(X=>Y) >= s and conf(X=>Y)
>= c, then supp(Y => X) >= s and conf(Y=>X) >= c.
a) If X => Y, then Y => X
b) If X => Y and Y => Z, then X => Z
c) If X => Z, then XY => Z
d) If XY => Z, then X=> Z
6. Find all frequent itemsets with minimum support of 50% using the
Apriori algorithm on the following database:
TID
|
Items
|
1
|
K, A, D, B
|
2
|
D, A, C, E, B
|
3
|
C, A, B, E
|
4
|
B, A, D
|
7. Suppose that you have a web site with a number of pages that all
link to each other so that you have no dangling links. You wish to
improve your PageRank. Consider the following strategies:
- Add links from your pages to high PageRank pages out on the web.
- Remove links from your pages to high PageRank pages out on the
web.
- Add links from your pages to low PageRank pages out on the web.
- Remove links from your pages to low PageRank pages out on the web.
- Force webmasters of low PageRank pages out on the web to link to
your pages.
- Force webmasters of high PageRank pages out on the web to link to
your pages.
- Force webmasters of low PageRank pages out on the web to remove
links to your pages.
- Force webmasters of high PageRank pages out on the web to
remove links to your pages.
For each of these strategies, explain whether it will help your
PageRank or hurt your PageRank. Also describe how markedly these
strategies change your PageRank relative to each other. Your answer
should be based on the details of the technique, and not on how likely
you are to actually influence other people to change their web pages on
your demand. Assume you have global control of the web.