CS 201: Search Engine

Search Engine

Overview

This is a team assignment.

For this assignment, we'll make a rudimentary search engine. Specifically, we'll load up all of the titles and plots from the Internet Movie Database (IMDB) into a map implemented as a binary search tree, and then allow the user to request titles and plots for all movies that start with a particular prefix.

Part 1: The Map

Create a class called TreeMapForStrings that will be used to map Strings to Strings. (The intention is to use it to map movie titles to movie plots.) Implement this map as a binary search tree. For this assignment, you should not use the built-in Java classes that do this task such as TreeMap, HashMap, TreeSet, or HashSet. Your code should have the following methods:

void put(String key, String value): add a key and a value to your tree. If the key is already present in the tree, replace the old value with the new one.
String get(String key): get the value associated with a particular key. get should return null if the word is not in the tree.

Make sure to test your map appropriately. A main method would be a nice way to do this.

Part 2: More Map

Add this method to your TreeMapforStrings class:

ArrayList<String> getKeysForPrefix(String prefix): return a list of all keys that start with the prefix. You should do this efficiently: specifically, you should not traverse the entire tree in order to find these keys.

Again, make sure to submit code to test your map appropriately.

Part 3: Filling up the map with title and plot data from IMDB, and running queries

About IMDB data

IMDB makes plain text files of all of its data available for download. I've grabbed a file called plot.list that contains titles and plot summaries for all of the movies in the database. While plot.list is the original file from IMDB, it has one major problem for this assignment: it is sorted by movie title, which works poorly for creating a binary search tree from scratch. Therefore, you should instead use plotunsorted.list, which I have created by randomly ordering the movies.

How to get the data

Because the file is so large (119 MB compressed, 318 MB uncompressed), I've had to place them in a shared CS department directory with enough disk space (that location is called "/Accounts/courses/"). That directory isn't accessible via the web, so you'll find directions below on how to obtain the file you need. But first: do not copy this file to any of your network-shared Carleton directories. If 35 students each copy and uncompress a (300 MB) file to their own network storage directory, that's a total of 10 gigabtes of network storage that we're wasting merely for everyone to have their own copy. Moreover, you'll be fighting with all the other students in the class to read the file over the network every time you read it in.

Also, in case you were thinking about being clever about saving disk space (which is often awesome), do not direct the program you write to read the file directly from the /Accounts/courses location without copying it first. If 35 students in the class repeatedly try to read a very large file at the same time, we may clog the network and bring it to a halt. This is the same problem we face if you put the file in your home directory (in addition to storage space issues).

So what should you do? If you're working on a lab machine (either native or in a VM), make a copy of the file to the directory /tmp. /tmp is a local directory on the computer that you're using. The file might get deleted on a reboot or a logout, but you can always recopy the file over if it vanishes. If you're using your own computer, you can put the file anywhere you want on your own computer. Make sure, however, that you are storing it locally and not in some network location.

(One minor source of irritation you may run into is that if someone who used the computer before you has already copied this file onto /tmp, you may not have permission to read the file. If that happens, just copy it again with a different file name. Alternatively, reboot the computer, and it will hopefully disappear.)

There are a variety of ways of copying the file, but here is one that will work under Mac or Linux: use "scp", which is a command-line program for doing a "secure copy." At a command-prompt, you can learn more about how it works by typing:

  man scp

Specifically, to grab the file you need, first use the cd command to change to the directory you want to put the file in. For example, if it's a lab machine, type

  cd /tmp

Then issue the following command, which securely copies the file from the remote location:

  scp username@skittles.mathcs.carleton.edu:/Accounts/courses/cs201/dmusican/plotunsorted.zip .

where username is your username. Don't miss the dot at the very end; that means copy the file into your current directory. skittles is the name of a CS department server that you're copying the file from. You'll need to enter your Carleton password once you issue the above command, and the file should start coming down. Once you've got it, unzip it, and you should be ready to go.

Windows computers frustratingly don't have scp or a similar program installed. If you're using Windows, you can install WinSCP, which should do the same thing, though you'll have to figure out how to configure it; I haven't played with it. If someone figures out how to do this, perhaps they could post detailed directions to Piazza.

Make a smaller version

For testing your program, you should make a small version of this file where you cut out most of the movies. Reading in the entire plotunsorted.list file takes a long time, and if you read from that file every time you debug your program, you'll add many hours to your coding and debugging waiting for it every time. This is a key general strategy that is really useful when working with big data; make a small version of the data to use for testing. Opening up a file this big in a text editor would likely be exceedingly slow and would use a huge amount of memory. Instead, try to use the UNIX command head which works on both Mac and Linux. (Again, Windows makes this harder.) Read the man page on head (type man head at a prompt) to see how it works. head normally dumps the output to the terminal window; instead, you'll need to redirect the output to a file. If you don't know how to do that, look up "redirecting the output" in this awesome UNIX tutorial.

Loading the IMDB data into your tree

Create a class called SearchEngine with a main that creates a TreeMapForStrings object, then loads up a file of movie data into it. Specifically, the movie titles should be the keys, and the plot summaries should be the values. Note that this data is messy in a few ways: this is typical for data that you get from other sources. You'll need to think about what kinds of processing you may want to do on the data to make the titles usable.

When you use Scanner to read the file, you'll need to add a special addition to it. The input file contains a lot of unusual characters, such as diacritical marks. Some of these odd characters will confuse Scanner and stop reading the file prematurely, unless you tell it that this file has an unsual encoding. When you open up the file, do it using a second parameter to Scanner like so:

  Scanner inp = new Scanner(new File("filename.txt"), "ISO-8859-1");

Since you'll be using more than one input file (the big one, and your small testing one), don't hardcode the name of your file into your program like "filename.txt" above. Instead, add it as command-line parameter. When you run your program, do so like this:

java -Xmx1g SearchEngine plotunsorted.list

java -Xmx1g SearchEngine plotunsortedsmall.list

Use the array of args in main to grab the filename and open it appropriately.

The "-Xmx1g" tells Java to allow a maximum about of memory of 1 gigabyte, which you'll need for loading up this much data. Without doing so, the default maximum amount of memory that Java uses is something like 128 MB or 256 MB (depending on the version of Java).

Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and printing out a status update every 1000 movies.

After loading up the data, you should enter a loop that prompts the user to enter in a movie title prefix. For example, if I enter in Flash, I should see the titles and plot summaries for all movies whose titles start with the characters Flash. Happily, if you've completed part 1, you've got a method that helps you out directly in doing this. After displaying the titles and plot summaries, prompt the user again for another prefix. Do this until the user types ####, in which case the program should end. I've checked that no movie titles start with that prefix.

Timing issues

Finally, in a readme.txt file that you submit along with your code, answer the following question:

Loading up the entire movie database is slow. Is that because reading the file itself is slow, or is because adding to the tree is slow? Experimentally determine which is the bottleneck, and explain your results. You may find the method System.currentTimeMillis() very useful here.

What to turn in

Part 1: Turn in your TreeMapForStrings class, which should include a main method for testing it with some small data.

Part 2: Turn in your updated TreeMapForStrings class, which should include a main method for testing it with some small data.

Part 3: Resubmit Part 2, as well as your SearchEngine class and your readme.txt.

Have fun, and good luck!

According to IMDB licensing, I should add:

Information courtesy of IMDb (http://www.imdb.com). Used with permission.