Machine Learning and Data Mining Assignment: PageRank

For this assignment, you will determine the PageRank for a series of web pages. Though it would be lots of fun to actually write a crawler and an HTML parser, I'll make your life significantly easier as well as remove the potential for crashing campus web servers. Instead, you'll use a dataset I generated a while back for this kind of purpose.

In the course directory, there is a file named webpages.txt that contains a text version of an archive of the department website. I generated this a few years ago via a script that used lynx to render the web pages in text format, and then I did some post processing of the output. The text _WEBPAGE_ is used to indicate the beginning of a new webpage, and is followed by the URL for this web page. The text for this web page follows. At the bottom of each web page, if there are links, is a section titled References which is followed by the links from each web page. You should be able to process this information to construct the link structure for the department website and determine the PageRank. Links to external sites should be treated as dangling links as described in the PageRank paper, as well as internal links that point to pages with no outgoing links.

Make sure to be smart and store your data in some kind of sparse format (or you'll most certainly run out of memory). You will likely want to use some kind of hash table to store your data.

You'll need to choose a value for E. Does your choice of E affect the results that you get? How? Ultimately pick a value of E that seems appropriate, and turn in on paper a list of the 50 highest ranked pages with their ranks. Also explain how the rankings varied with E, and how you chose a value. Provide an analysis of the results: are they reasonable? Are there any surprises? How many iterations did it take for PageRank to stabilize? How many iterations of "dangling link elimination" did you need to do? Finally, submit your code via hsp.