CS 127, Data Structures

Final Project: "COWS", the Carleton Original Web Search engine

Assigned on Friday, 11/15.
Program due electronically by 5:00 PM on Monday, 11/25 (no extensions by College policy!)

Overview

For this assignment, you will create a web search engine. Google's days are numbered!

Specifically, you will process the file /Accounts/courses/cs127/dmusican/webpages.txt, which contains copies of many of the web pages from our department's web site. For each web page, you will create a node in a hash table that contains the name of the web page, the number of times each word appears in that web page, and the number of web pages that link to that web page.

Once this structure is complete, you will allow a user to enter a keyword to search for. Your search engine will look through its data and print out the URLs for the top ten web pages that satisfy that query, in descending order based on the number of times the keyword appears in that document. You will also print out the number of pages that link to each of your documents.

Part 1: Create the class BinarySearchTree

Create a class called BinarySearchTree. Each node in this tree should contain a string called keyword and an integer called count. You should provide the following methods:

constructor
destructor
int wordcount(string word) - this method searches for a particular word in the tree. If it finds it, it returns the count associated with it. If it does not find the word, it returns 0.
void increment(string word) - this method searches for a particular word in the tree, and adds one to the number of times the word appears. If the word is currently not in the tree, it should be added with a count of 1.

Debug and test this class befre moving on.

Part 2: Create the class WebPageContent

This part is easy: just make a wrapper class to hold a string with a URL, an integer to contain the number of pages that link to this page, and a BinarySearchTree.

Part 3: Collect word counts for all the web pages, and the page rank

The file /Accounts/courses/cs127/dmusican/webpages.txt is a single file that contains ascii versions of a number of web pages from our department's web site. The HTML for each new web page begins wherever you see the text "_WEBPAGE_", which is then followed by a URL. For example, here are the first few lines of webdata.txt:

_WEBPAGE_ http://www.mathcs.carleton.edu/crew/progress.htm


                                 Progress


   Our results as of the end of the 2001-2002 schoolyear are summarized
   in the poster which we presented at the [1]Grace Hopper Celebration of
   Women in Computing 2002 Conference, which we attended in Vancouver,
   BC.

Write a function that returns a hash table containing WebPageContent objects, indexed by the name of the web page. Each entry of the hash table should be a WebPageContent object. Here is an outline of how you should proceed:

Scan webpages.txt.
Whenever you find the text _WEBPAGE_, see if an entry in the hash table has been already created for this page. It might have been if another page linked to this one.
Scan through all the text of this web page, incrementing the number of times each word appears in the binary search tree associated with this web page.
If you read a word that begins with http://www.mathcs.carleton.edu, you should consider this a link. Look for an object associated with this web page in your hash table. If you find it, increment its page rank (the number of pages that link to it). If you do not, create a new object for this page in the hash table with a page rank of 1.

Part 4: Run a query

Your program, when complete, should prompt the user to enter a single keyword to search for. Your code should then search through each web page in the hash table, and find those pages that contain the keyword. The URLs of these pages should be displayed to the screen, along with the number of times the word appears, in descending order. You should also simultaneously display the page rank for each of these pages.

Other Details

Your search should be case insensitive. The easiest way to acheive this is probably to convert all words to lower case before you store them.
I will test your code using Insure to make sure that you do not have any memory errors or memory leaks. You should do the same!
Testing your program on the entire webpages.txt will be slow. Copy it and make a small version for yourself so that your program will run quickly. You may want to make sure that you at least capture a few pages that link to each other so that you can test your page rank utility.
Make sure you code your program incrementally. Test your program in small pieces. Get it completely working without page rank, then add it later. Set deadlines for yourself on various pieces of the assignment so that you know whether or not you are falling behind. Here is a recommended schedule (but this is entirely up to you):

Binary Search Tree working and tested by Sunday, 11/17
Hash table with web page counts working and tested by Tuesday, 11/21
Query functionality working and tested by Thursday, 11/22
Page rank functionality working and tested by Saturday. 11/23
Sunday, 11/24: Slack day in case something went wrong