CS 257: Software Design

A web crawler

A web crawler is a program that starts with a list of links/URLs/addresses to web pages, and works to traverse the pages reachable by following links starting at the starting pages. The best-known web crawlers are the Googlebots (and their friends the bingbots, etc.), used by Google to collect the data used by their search engine.

For this assignment, you will write a small web crawler. This web crawler's target audience will be people who want to collect information about their own web sites, so its specific features focus on that goal.

Details

Your program should be invokable from a Unix command line, like so:

python crawler.py [options] startingURL

This command produces a report of some kind (depending on the specific options included on the command line) based on doing a web crawl starting with the specified URL.

The available options are:

--linklimit <number> (default value=1000): This option specifies the maximum number of links crawler.py will examine starting at the given URL before terminating. For example, with "--linklimit 20", crawler.py will retrieve and examine no more than 20 pages. Note that this option is here both for debugging purposes and to make sure crawler.py doesn't get stuck in an infinite (or might-as-well-be-infinite) loop. If you wish, you could implement a "no limit" option here. For example, "--linklimit infinity".

--searchprefix <prefix>: The idea here is that you will only retrieve and examine pages whose full addresses (without the "http://") begin with this prefix. For example, "--searchprefix cs.carleton.edu/faculty/jondich" would restrict your crawling to pages in that sub-portion of the CS Department's web site. Note that you may retrieve pages outside this prefix (e.g. to make sure the links aren't broken), but you won't examine such pages for other links.

If --searchprefix is not specified, it defaults to the domain portion of the starting URL. For example, if there's no --searchprefix and startingURL is "http://cs.carleton.edu/faculty/jondich/", then the search prefix will default to "cs.carleton.edu".

Note that if the search prefix is not a prefix of the post-http:// portion of startingURL, then the crawler should terminate without retrieving any pages at all.

--action brokenlinks: If this option is specified, crawler.py will print a list of all the broken links found during the crawl. Each broken link will be reported as one line of text consisting of: the URL of the page in which the broken link was found, then a comma, and then the URL that is broken.

--action outgoinglinks: If this option is specified, crawler.py will print a list of all the links it finds that go outside the search space.

--action summary (this is the default if no action is specified): If this option is specified, crawler.py will print out a summary of the crawl. The summary will be formatted as follows:

FilesFound: <number of distinct files retrieved and examined>
LongestPathDepth: <the number of URLs included in the longest path from startingURL>
LongestPath: <a list of the URLs in one longest path, one per line starting on the line after the LongestPath: label>
CantGetHome: <a list of URLs from which you can't get back to the startingURL, one per line after the CantGetHome: label>

You may implement any other summary information you wish, but include each of these.

Notes

Questions? Let me know.