Writing Your Web Crawler  
 
 

Final Project

Assignment:

Write a crawler to download all the pages discovered from seed URLs. You will be running this against a local repository of web pages so you do not need to worry about network programming or following a robots.txt file but you will have to hard code the directory separators for your particular OS. You will also not have to worry about dynamic pages with infinite link generation. Similarly, without access to the content-type header you will have to choose which types of files you do not wish to index based on file name extensions.

Your program should fully parse the links to crawl the pages. As each page is crawled it should be "downloaded" into another folder to be handed off to an indexer. You will not need to worry about scalability (as it should not take long to run an entire iteration) but be sure you are crawling as much of this web as you can automatically. It may be helpful to print out how many pages were crawled at the end of an iteration to give you a good idea of how well your code is following links and where to seed. Be careful to avoid following circular links and make your parsing as robust as possible in order to deal with all of the malformed HTML that exists.

Website Archives:
Four websites have been previously downloaded and archived to test your program against. They are available for download on the right.

  • Access To Insight: http://www.accesstoinsight.org/tech/download/bulk.html (Jan 25, 2009)
  • Stanford Encyclopedia of Philosophy: http://plato.stanford.edu/ (Jan 26, 2009)
  • Internet Encyclopedia of Philosophy: http://www.iep.utm.edu/ (Feb 09, 2009)
  • Creative Commons: http://www.creativecommons.org/ (Feb 09, 2009)


Files to Be Downloaded