Machine Learning and Data Mining

Machine Learning and Data Mining: Exam 2

(15 points) Of the following statements, which are true and which are false? For those that are false, give a specific example with a dataset showing that they are false. For those that are true, prove it. Assume that all statements below should be read in the context of particular support and confidence thresholds. For example, the first rule below really says: If supp(X=>Y) >= s and conf(X=>Y) >= c, then supp(Y => X) >= s and conf(Y=>X) >= c.

a) If X => Y, then Y => X
b) If X => Y and Y => Z, then X => Z
c) If X => Z, then XY => Z
d) If XY => Z, then X=> Z

(15 points) Find all frequent itemsets with minimum support of 50% using the Apriori algorithm on the following database:

TID	Items
1	K, A, D, B
2	D, A, C, E, B
3	C, A, B, E
4	B, A, D

(15 points) Suppose the World Wide Web consisted of four web pages labeled W, X, Y, and Z. Suppose that W links to X and Y, X links to W and Y, Y links to X and Z, and Z links to X and Y. Determine the PageRank for each of these five pages, assuming that the PageRank "E" value is set to 0. Show on paper three iterations of the PageRank algorithm (i.e., using the power method), assuming that you start with an initial PageRank of 1 for all four pages. You may use computational tools (Octave, Mathematica, your favorite spreadsheet, etc.) or code that you have previously written, but the work that you turn it should show clearly that you understand the algorithm (don't just show a bunch of numbers printed out by some program).
(a) (8 points) Describe the way in which HITS will find for you a webpage that is authoritative on your search query but does NOT actually contain any of the words from your query.

(b) (8 points) For the heuristics mentioned on pages 7-8 of Chakrabarti et. al. paper (HITS), explain which of the three problems mentioned at the top of page 7 they address and how they help to resolve these problems. Indicate whether each of these heuristics would also be useful for PageRank.

(15 points) Suppose you are given the following ratings by Carleton students on four different courses, where a ? indicates that no rating was given:

Student ID	course 1	course 2	course 3	course 4
1	3	?	1	2
2	1	2	?	3
3	3	3	1	5
4	?	1	3	5

Using the correlation technique described in the Breese et. al. paper (collaborative filtering), what would be the predicted vote for student 4 on class 2? What would be the vote using the vector similarity method? Assume that none of the extensions described in section 2.2 of the paper are used.

(12 points) In the collaborative filtering paper, it is suggested that Inverse User Frequency is a good idea: if more people have voted on an item, it should be considered less important. Bayesian Networks, on the other hand, rely on probability estimates which become more accurate as you increase the number of people used in the probability calculations.the fact that lots of people have an opinion about a particular object to generate responses for unseen objects. The experimental results indicate that both Inverse User Frequency and Bayesian Networks are a good idea. Explain away this apparent contradiction.
(12 points) In the "Data Mining: Good, Bad, or Just a Tool" panel discussion that you saw, Raghu Ramakrishnan (the moderator) makes the statement that "SQL [language for querying databases] is sufficiently evil." What relevance does such a statement have in regards to the topic? Is this statement accurate?