CS 127: Data Structures

Assignment 7: Finding Most Relevant Web Pages

Assigned on Monday, 10/25.
Due on Monday, 11/1 at 5 pm.

Overview

In this assignment, we will implement the part of a web search engine that ranks pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in a particular query. Your main class for this assignment should be called ProcessQueries, and will be used as follows:

java ProcessQueries urlListFile ignoreFile

The urlListFile should contain a list of URLs, one per line. These URLs should correspond to files that are on our file system. An example urlListFile might contain:

/Accounts/courses/cs127/dmusican/dept-web-site/contact.html
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html
/Accounts/courses/cs127/dmusican/dept-web-site/guideToTheCSMajor.html
/Accounts/courses/cs127/dmusican/dept-web-site/index2.html
/Accounts/courses/cs127/dmusican/dept-web-site/stufflist.html
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html
/Accounts/courses/cs127/dmusican/dept-web-site/websearch.html

The ignoreFile should contain a list of words that you would like to ignore when counting word frquencies (just as in the last assignment).

Create a class called URLContent that will store two things: a String containing a URL, and a WordFrequencyTree that will contain the frequency counts for that URL. If you had trouble getting your last assignment working, you can use a HashMap instead; see me immediately so that I can help get you started.

When your program starts, it should process all the URLs in the list by creating a URLContent object for each, and build a WordFrequencyTree for each. (Keep all of these URLContent objects in an ArrayList.) Once you have finished processing all of these URLs, your program should enter a loop as shown below. The program should prompt the user to enter a search query (or -1 to quit), and then list all URLs that match the query in order of the best match first and the worst match last. Include each result URLs priority in parentheses after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list. The priority of a page is defined as the number of times a word in the query appears on that page. Here is a sample run (not necessarily correct):

Enter a query or -1 to quit.

Search for: computer science
Relevant pages:
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = x)
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = y)

Search for: mtie
Relevant pages:
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = x)
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = y)
/Accounts/courses/cs127/dmusican/dept-web-site/contact.html (priority = z)

Search for: -1

To process a query, create a HeapPriorityQueue. For each URL, determine its priority, then add that URL to the priority queue. After you have processed all URLs, use the priority queue to print out the URLs in order with the priorities next to them.

Getting Started: create a priority queue

To get started create a class called HeapPriorityQueue. This should be a priority queue, implemented as a heap, with the usual methods for inserting and removing items. Test out your priority queue on some sample data. You will be using your priority queue to store Strings. There is priority queue code in your textbook which you can use as a reference (make sure to cite it if you do), but you will undoubtedly learn more and have more fun if you try to build it yourself.

Building the query interface

Next, build the class ProcessQueries. (This is where your main program will lie.) In the main method, use Scanner to read in both the urlListFile and the ignoreFile in a similar manner to the last assignment.

For each URL that you read in, process the actual HTML file and determine the word frequencies. Store that information as a URLContents object.

Running the query

Finally, actually implement the query itself. Determine the priority for each URL, and insert it into a priority queue. Then print out the matching URLs in order of best to worst. Your program should handle multiple word queries. The priority for a given URL should be determined as the sum of the number of matches for each word in the query.

Have fun, and good luck! Remember to start early and make incremental progress.

Many thanks to Tia Newhall and Lisa Meeden at Swarthmore College.