CS 127, Data Structures

Final Project: "COWS", the Carleton Original Web Search engine

cow
Assigned on Friday, 11/15.
Program due electronically by 5:00 PM on Monday, 11/25 (no extensions by College policy!)

Overview

For this assignment, you will create a web search engine. Google's days are numbered!

Specifically, you will process the file /Accounts/courses/cs127/dmusican/webpages.txt, which contains copies of many of the web pages from our department's web site. For each web page, you will create a node in a hash table that contains the name of the web page, the number of times each word appears in that web page, and the number of web pages that link to that web page.

Once this structure is complete, you will allow a user to enter a keyword to search for. Your search engine will look through its data and print out the URLs for the top ten web pages that satisfy that query, in descending order based on the number of times the keyword appears in that document. You will also print out the number of pages that link to each of your documents.

Part 1: Create the class BinarySearchTree

Create a class called BinarySearchTree. Each node in this tree should contain a string called keyword and an integer called count. You should provide the following methods:
Debug and test this class befre moving on.

Part 2: Create the class WebPageContent

This part is easy: just make a wrapper class to hold a string with a URL, an integer to contain the number of pages that link to this page, and a BinarySearchTree.

Part 3: Collect word counts for all the web pages, and the page rank

The file /Accounts/courses/cs127/dmusican/webpages.txt is a single file that contains ascii versions of a number of web pages from our department's web site. The HTML for each new web page begins wherever you see the text "_WEBPAGE_", which is then followed by a URL. For example, here are the first few lines of webdata.txt:

_WEBPAGE_ http://www.mathcs.carleton.edu/crew/progress.htm

                                 Progress

   Our results as of the end of the 2001-2002 schoolyear are summarized
   in the poster which we presented at the [1]Grace Hopper Celebration of
   Women in Computing 2002 Conference, which we attended in Vancouver,
   BC.

Write a function that returns a hash table containing WebPageContent objects, indexed by the name of the web page. Each entry of the hash table should be a WebPageContent object. Here is an outline of how you should proceed:
  1. Scan webpages.txt.
  2. Whenever you find the text _WEBPAGE_, see if an entry in the hash table has been already created for this page. It might have been if another page linked to this one.
  3. Scan through all the text of this web page, incrementing the number of times each word appears in the binary search tree associated with this web page.
  4. If you read a word that begins with http://www.mathcs.carleton.edu, you should consider this a link. Look for an object associated with this web page in your hash table. If you find it, increment its page rank (the number of pages that link to it). If you do not, create a new object for this page in the hash table with a page rank of 1.

Part 4: Run a query

Your program, when complete, should prompt the user to enter a single keyword to search for. Your code should then search through each web page in the hash table, and find those pages that contain the keyword. The URLs of these pages should be displayed to the screen, along with the number of times the word appears, in descending order. You should also simultaneously display the page rank for each of these pages.

Other Details

Good luck, stay on time, and ask for help early!



This project was based on an idea from Tia Newhall and Lisa Meeden at Swarthmore College.