CS 127: Data Structures

Caching Your Search Query Results

Note to self: This assignment could be improved by better isolating the priority queue from the rest, and providing (perhaps) interfaces for the design as a whole to keep the code from getting too cluttered.

Overview

For this assignment, you will add a cache of query results to your search engine. This speeds up the execution of queries by using the cached results of previous queries. You will also integrate your search engine into your web browser.

Caching

Below is a demonstration of how your more efficient query processing program should behave. You will save all previous results in a data structure called the cache. When the user types in a query, you will first check the cache to see if you have processed a similar request in the past.

Enter a query or -1 to quit.

Search for: computer
0 words from query found in cache
Relevant pages:
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = x)
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = y)
/Accounts/courses/cs127/dmusican/dept-web-site/contact.html (priority = z)

Search for: computer science
1 word from query found in cache
Relevant pages:
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = a)
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = b)
/Accounts/courses/cs127/dmusican/dept-web-site/contact.html (priority = c)

Search for: science
1 word from query found in cache
Relevant pages:
/Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = m)
/Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = n)

Search for: -1

Notice that the first time the user asks about "computer" we cannot find anything in the cache. But the next time the user asks about "computer science" we can grab the stored information about "computer" and perform a search of WordFrequencyTrees for just the "science" part. The priority for each page is then determined as the sum of the priorities for the words "computer" and "science." When the query "science" is entered, its result simply can be obtained from the stored informatation in the cache. This approach yields the added advantage that we want the query results to be the same regardless of the word order, so "computer science" is the same as "science computer".

The cache should be represented as a hash map of your own construction where the key is one of the words in your query. The data stored in the hash map along with the key should be the count of the number of times the key appears in each web page. You should not use any of the built-in Set or Map classes for this assignment.

The best place to implement caching is in your ProcessQueries class.

Integrating your search engine with your browser

Completion of the caching described above, if done correctly, is sufficient for full credit on this assignment. However, you may find the assignment much more exciting if you can integrate the results with your Minibrowse code. Here is how to do so:

  1. Modify your Minibrowse code so that the URL list and the ignore list are input as command-line arguments. This information should then be passed off to ProcessQueries as necessary.

  2. Add a text box to Minibrowse where the user will type in a query, and a "Search" button to trigger the search.

  3. In Minibrowse, construct a ProcessQueries object (if necessary) and run the appropriate methods to generate search results.

  4. Modify ProcessQueries so that instead of dumping the query results to the screen, it returns a string with the query results.

  5. In Minibrowse, instead of using the setPage method of JEditorPane for showing a web page, use the setText method which takes an HTML string as a parameter. Your string will thus have to be in HTML format.

Have fun, and good luck!


Many thanks to Tia Newhall and Lisa Meeden at Swarthmore College.