A web crawler is a program that starts with a list of links/URLs/addresses to web pages, and works to traverse the pages reachable by following links starting at the starting pages. The best-known web crawlers are the Googlebots (and their friends the bingbots, etc.), used by Google to collect the data used by their search engine.
For this assignment, you will write a small web crawler. This web crawler's target audience will be people who want to collect information about their own web sites, so its specific features focus on that goal.
Your program should be invokable from a Unix command line, like so:
python crawler.py [options] startingURL
This command produces a report of some kind (depending on the specific options included on the command line) based on doing a web crawl starting with the specified URL.
The available options are:
--linklimit <number> (default value=1000): This option specifies the maximum number of links crawler.py will examine starting at the given URL before terminating. For example, with "--linklimit 20", crawler.py will retrieve and examine no more than 20 pages. Note that this option is here both for debugging purposes and to make sure crawler.py doesn't get stuck in an infinite (or might-as-well-be-infinite) loop. If you wish, you could implement a "no limit" option here. For example, "--linklimit infinity".
--searchprefix <prefix>: The idea here is that you will only retrieve and examine pages whose full addresses (without the "http://") begin with this prefix. For example, "--searchprefix cs.carleton.edu/faculty/jondich" would restrict your crawling to pages in that sub-portion of the CS Department's web site. Note that you may retrieve pages outside this prefix (e.g. to make sure the links aren't broken), but you won't examine such pages for other links.
If --searchprefix is not specified, it defaults to the domain portion of the starting URL. For example, if there's no --searchprefix and startingURL is "http://cs.carleton.edu/faculty/jondich/", then the search prefix will default to "cs.carleton.edu".
Note that if the search prefix is not a prefix of the post-http:// portion of startingURL, then the crawler should terminate without retrieving any pages at all.
--action brokenlinks: If this option is specified, crawler.py will print a list of all the broken links found during the crawl. Each broken link will be reported as one line of text consisting of: the URL of the page in which the broken link was found, then a comma, and then the URL that is broken.
--action outgoinglinks: If this option is specified, crawler.py will print a list of all the links it finds that go outside the search space.
--action summary (this is the default if no action is specified): If this option is specified, crawler.py will print out a summary of the crawl. The summary will be formatted as follows:
FilesFound: <number of distinct files retrieved and examined>
LongestPathDepth: <the number of URLs included in the longest path from startingURL>
LongestPath: <a list of the URLs in one longest path, one per line starting on the line after the LongestPath: label>
CantGetHome: <a list of URLs from which you can't get back to the startingURL, one per line after the CantGetHome: label>
You may implement any other summary information you wish, but include each of these.
You may work alone or with one or two partners for this assignment.
How will you test this? You may certainly use the course home page, which has relatively few links, as a starting place. Be careful, though, to avoid overwhelming the CS server in case of bugs in your code that lead to infinite loops.
It is generally polite to introduce a short delay, say half a second, between each request to a web server (import time, then time.sleep(.5)). You don't want to be creator of an unintentional denial-of-service attack.
This could be a moderately complex program, so don't write it all and then try to fix everything at once. Make an incremental development plan. Small step forward, test and debug. Another small step forward, test and debug. For example, phase 1 could be "parse the command line and print out all the selected options." Then phase 2 could be "retrieve the starting page and print all its links." Phase 3: "retrieve all the pages linked from the starting page and report those pages' sizes or HTTP error codes or something." etc. Break it down into testable baby steps, and test carefully at each step.
Have fun!
Questions? Let me know.