Search Engine via Hashing
Table of Contents
This is a pair programming assignment. If you are working in a pair, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Set a timer to swap every 15 minutes. You can choose your favorite from this online timer page. Make sure your sound volume is audible, but not so loud to disturb the people around you.
We will use anonymous grading on Moodle, which means that the grader won't see your name until after the grading is done. This is an easy way to help add an extra element of fairness to the grading. Therefore, make sure your name doesn't appear on your actual submission. When you submit via Moodle, it will know you are. Thanks!
Overview
For this assignment, we'll make a rudimentary search engine. Specifically, we'll load up all of the titles and plots from the Internet Movie Database (IMDb) into a map implemented as a hash table, and then allow the user to request titles and plots for movies as desired.
Part 1: Create a Map
Create a class called HashMap201
that will be used to map Strings to
Strings. (The intention is to use it to map movie titles to movie plots.)
Implement this map as a hash table with chaining via linked lists. For this
assignment, you should not the built-in Java classes that do this task such as
TreeMap
, HashMap
, TreeSet
, or HashSet
. That said, you should use the
built-in class LinkedList
, for your chains. Your map should have the following
methods:
void put(String key, String value)
add a key and a value to your map. If the key is already present in the map, replace the old value with the new one.
String getValue(String key)
get the value associated with a particular key. Return
null
if the key is not in the map.
Make sure to test your map appropriately. A main method would be a nice way to do this.
You should not be searching the web or other sources for similar code to this. You are welcome to look at the textbook.
Optional extra challenge
If you would like an optional additional challenge, make your class generic
instead of String specific. In that case, your HashMap201<K,V>
class methods
would look like:
void put(K key, V value)
V getValue(K key)
Using the built-in LinkedList
classes
To use the built-in LinkedList
class, review the various methods that it
offers. One source of trickiness is that for each entry, you need to store both
a key and a value. You'll have to think through how you want to do that. You'll
also need to address how to efficiently search through the linked list and be
able to change one of the items if you need to without having to find it all
over again. For this purpose, the so-called for-each loop can be quite useful. Here
is a sample as to how you might use it:
// Assume I've created a class called Entry which contains a key and a value LinkedList<Entry> mylist = new ListList<Entry>(); // ... add a bunch of items to it ... for (Entry e : mylist) { System.out.println(e.key); System.out.println(e.value); e.value = someNewValue; }
The for-each loop works on any class that implements the Iterable
interface. Java Interlude 5 in your textbook has lots more information on how
iterators work, which forms the basis for what the for-each loop does.
Part 2: Fill up the Map with data from IMDb, and run queries
About IMDb data
IMDb makes plain text files of all of its data available for download. We're
going to use a file called plotunsorted.list
that contains titles and plot
summaries for all of the movies in the database. Because the file is so large
(138 MB compressed, 404 MB uncompressed), I've had to place them in a shared CS
department directory with enough disk space (that location is called
/Accounts/courses/
). That directory isn't accessible via the web, so you'll
find directions below on how to obtain the file you need. But first: do not copy
this file to any of your network-shared Carleton directories. If 35 students
each copy and uncompress a (400 MB) file to their own network storage directory,
that's a total of 14 gigabtes of network storage that we're wasting merely for
everyone to have their own copy. Moreover, you'll be fighting with all the other
students in the class to read the file over the network every time you read it
in.
Also, in case you were thinking about being clever about saving disk space
(which is often awesome), do not direct the program you write to read the file
directly from the /Accounts/courses
location without copying it first. If 35
students in the class repeatedly try to read a very large file from the same
network storage location at the same time, repeatedly while debugging code, we
may clog the network and bring it to a halt. This is the same problem we face if
you put the file in your home directory (in addition to storage space issues).
So what should you do? If you're working on a lab machine, make a copy of the
file to the directory /tmp
. /tmp
is a local directory on the computer that
you're using. The file might get deleted on a reboot or a logout, but you can
always recopy the file over if it vanishes. If you're using your own computer,
you can put the file anywhere you want on your own computer. Make sure, however,
that you are storing it locally and not in some network location.
(One minor source of irritation you may run into is that if someone who used the
computer before you has already copied this file onto /tmp
, you may not have
permission to read the file. If that happens, just copy it again with a
different file name. Alternatively, reboot the computer, and it will hopefully
disappear.)
There are a variety of ways of copying the file, but here is one that will work
under Mac or Linux: use scp
, which is a command-line program for doing a
"secure copy." At a command-prompt, you can learn more about how it works by
typing:
man scp
Specifically, to grab the file you need, first use the cd
command to change to
the directory you want to put the file in. For example, if it's a lab machine,
type
cd /tmp
Then issue the following command, which securely copies the file from the remote location:
scp username@mirage.mathcs.carleton.edu:/Accounts/courses/cs201/dmusican/plotunsorted.list.gz .
where username
is your username. Don't miss the dot at the very end; that
means copy the file into your current directory. mirage
is the name of a CS
department server that you're copying the file from. You'll need to enter your
Carleton password once you issue the above command, and the file should start
coming down. Once you've got it, you'll need to uncompress it. That file is
compressed with the gz
compression format. To uncompress it, on a Mac or Linux
issue the following command. (Check again first: are you in the directory /tmp
?)
gunzip plotunsorted.list.gz
Windows computers frustratingly don't have scp or a similar program installed.
If you're using Windows, you can install WinSCP, which should do the same thing,
though you'll have to figure out how to configure it; I haven't played with it.
Likewise, to unzip a gz
file under Windows, you'll likely need to install
either Winzip or 7-zip.
Make a smaller version
For testing your program, you should make a small version of this file where you
cut out most of the movies. Reading in the entire plotunsorted.list
file takes a long
time, and if you read from that file every time you debug your program, you'll
add many hours to your coding and debugging waiting for it every time. This is a
key general strategy that is really useful when working with big data; make a
small version of the data to use for testing. Opening up a file this big in a
text editor would likely be exceedingly slow and would use a huge amount of
memory. Instead, try to use the UNIX command head
which works on both Mac and
Linux. Read the man page on head
(type man head
at a prompt) to see how it
works. head
normally dumps the output to the terminal window; instead, you'll
need to redirect the output to a file. If you don't know how to do that, look up
"redirecting the output" in this awesome UNIX tutorial.
(Again, Windows makes this harder. If you're using a Windows computer, it appears that the current cleanest way to do this is to use some PowerShell commands. If you plan to continue to work extensively with Windows, it would be a worthwhile experience to learn how to use PowerShell. Alternatively, you could do this piece on a department Mac and transfer the file to your computer.)
Load the IMDb data into your map
Create a class called HashSearchEngine
with a main that creates a
HashMap201
object, then loads up a file of movie data into it.
Specifically, the movie titles should be the keys, and the plot summaries should
be the values. Note that this data is messy in a few ways: this is typical for
data that you get from other sources. You'll need to think about what kinds of
processing you may want to do on the data to make the titles usable.
When you use Scanner
to read the file, you'll need to add a special addition to
it. The input file contains a lot of unusual characters, such as diacritical
marks. Some of these odd characters will confuse Scanner
and stop reading the
file prematurely, unless you tell it that this file has an unsual encoding. When
you open up the file, do it using a second parameter to Scanner
like so:
Scanner inp = new Scanner(new File("filename.txt"), "ISO-8859-1");
Since you'll be using more than one input file (the big one, and your small
testing one), don't hardcode the name of your file into your program like
"filename.txt" above. Instead, add it as command-line parameter; use the array
args
in main
to grab the filename and open it appropriately.
- If you are using your own computer and you have a limited amount of memory,
you might run out of memory while loading up the hash table. By default, Java
only allows you to use around 1/4 of your computer's memory before running
out. If you get an out of memory error when trying to run your code, you can
change Java's default in jGRASP by going to the Settings, Compiler Settings,
Workspace screen, going to the "Run" row in the second column (the one
labeled "FLAGS2 or ARGS2"), clicking the square next to it so that you can
change the value, and put in
-Xmx1g
. This tells Java to use 1 GB of memory, even if this is more than 1/4 of the memory that your computer has.
Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and print out a status update every 1000 movies.
After loading up the data, you should enter a loop that prompts the user to
enter in a movie title. For example, if I enter in Star Wars (1977)
, I
should see the plot summary for that movie. After displaying the plot summary,
prompt the user again for another movie. Do this until the user types ####
, in
which case the program should end. I've checked that no movie titles start with
that prefix.
Evaluation of how well it works
Finally, you should perform some experiments to see how the amount of time it takes to run varies with the size of the main hash array. Try a variety of array sizes, and submit a plot made with the tool of your choice showing how runtime changes with respect to array size.
- To measure runtime, the method
System.nanoTime()
is really useful. You can measure the time on the clock before and after doing something, and then calculate the difference.
Submit a readme.txt
file that contains a brief summary of your findings,
and a PDF with your plot.
What to turn in
Part 1: Turn in your HashMap201
class, which should include a main
method for testing with some small data.
Part 2: Turn in your HashSearchEngine
class, as well as your plot showing
performance for different array sizes.
Acknowledgements
According to IMDb licensing, I should add:
Information courtesy of IMDb (http://www.imdb.com/). Used with permission.