CS 127: Data Structures

Assignment 6: Web Page Word Frequency

Assigned on Wednesday, 10/20.
Due on Monday, 10/25 at 5 pm.

Overview

In this assignment, we begin the process of integrating a search engine into our web browser. The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this assignment you will use a map implemented as a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out all words (in alphabetical order) that occur at least the minimum frequency number of times and print each of these word's frequency count.

WORD       FREQUENCY
----       ---------
artifical  5
cat        3
dog        7
... 

Getting Started

As we have in the past, we will use the Scanner class for reading in files. In particular, we will use the next() method, which grabs one token at a time. A token, in this context, is any string of characters that is not broken up by whitespace or non-alphanumeric characters.

To get started, create a class called WordFrequencyTree. In the main method, use a Scanner to read in a file, one token at a time, and output the tokens to the screen as you read them. You might want to use FileInputDemo.java as a reference, though remember that we will be using next() instead of nextLine(). You should try your program out on an HTML file, since these are the files that describe how web pages are rendered. For some sample web pages, look in /Accounts/courses/cs127/dmusican/dept-web-site. In this directory, we have a copy of an old version of our department website. You should also be able to include the filename as a command-line argument, just as FileInputDemo.java does.

After you have your Scanner code working, you should add the following useDelimiter code to your program:

in = new Scanner(new File(args[0]));
in.useDelimiter("[\\s\\W]+");

This indicates that any characters that are whitespace and not part of words, i.e. punctuation and the like, should be used as delimiters and not actually read.

Crafting your binary search tree

Next, you should implement an add method to add a word to your tree. If the word is not present in the tree, add it with a count of 1. If the word is already in the tree, increment its count. Also add a get method to get the count associated with a particular word. get should return 0 if the word is not in the tree. Finally, add a method named print that prints out all the words (and their frequencies), in alphabetical order, for words that occur at least min_freq_num times.

For this assignment, you should not the built-in Java classes that do this task such as TreeMap, HashMap, TreeSet, or HashSet.

Putting it all together

You will actually use two different binary search trees in your main method. One such tree will be used for storing frequencies of words. The other tree will be used for storing words in an "ignore list." The "ignore list" contains a list of HTML tokens that your program should ignore when it reads your HTML file. For example, every HTML file has a

token. Your search engine will want
to ignore this, since it should focus on actual words in the web
page.

In the end, your main method should take three command-line arguments. The first one is the filename of the HTML file to be read. The second is the filename of the ignore list, which should just be a file containing a list of HTML tokens to ignore. The third argument should be the minumum word frequency to use when printing.

Part of this assignment involves creating an ignore list. Don't spend a lot of time on this, but see if you can figure out some of the most important HTML tokens to get in here.

Have fun, and good luck!


Many thanks to Tia Newhall and Lisa Meeden at Swarthmore College.