CS 201: Data Structures

Final Project

Partners: you may work alone or with a partner of your choosing

Project Option #1: A Word Game Helper

I like word games. I do the New York Times crossword every day, and I will play Boggle with anybody willing to play with me (and when they're not, I just play alone on my phone). Scrabble, Bananagrams, acrostics, Perquackey, Text Twist, just fooling around with anagrams? Bring 'em on.

Sometimes I play my games strictly--no dictionaries and no web searches allowed. But sometimes it's fun to cheat. What's the best possible Scrabble word I could make out of these seven letters? Can I find the Boggle board with the largest possible number of words on it? What words could go into 17-down in my current crossword puzzle?

If you choose this project option, you will write a program that answers question #1 below and also either question #2 or question #3 (i.e. you only have to implement two of these, but one of them has to be #1):

  1. Given two words of identical length, what's the shortest word ladder you can construct between them? (e.g. you can go from HATE to LOVE in three steps: HATE - LATE - LAVE - LOVE, or HATE - HAVE - HOVE - LOVE, etc.).
  2. Given a string of letters, find all their one-word anagrams. (e.g. if your string is PART, you'll get PART, TARP, TRAP, and PRAT.)
  3. Given a string of letters and asterisks, find all the words of the same length as your search string that match the letters exactly. That is, the asterisks in your search string are a "wildcard" character that means "any letter can go here." For example, if your search string is G**W, you would get GLOW, GNAW, GNOW, GREW, and GROW. (A dictionary tells me that "gnow" is "the Mallee Fowl of Western Australia." So now you gnow.) This feature, by the way, is extremely handy for crossword puzzles.

This project's command-line syntax should follow this pattern for the two features you implement:

java ladder startWord endWord java anagram string java wildcard string

NOTE: since bash or whatever other Unix shell you're using will interpret * in its own special way, you'll need to put the wildcard search string between single-quotes, like:

java wildcard 'G**W'

Project Option #2: Six Degrees of Mary Pickford

This assignment works fine in principle, but the data files are very large, and most students at the CS201 level aren't quite ready to handle that. In practice, even with strong students, the graph traversal turns out to be the easiest part of this assignment, dwarfed by the problem of getting the data into memory (or doing sufficiently fast searches directly on the data files).

A long time ago, I started playing around with what my family referred to as the "handshake game." The idea is that you try to construct chains from one person to another, where each link is a pair of people who have met each other (they've shaken hands, at least metaphorically). The example I usually use as an example is this five-step chain from me to Mozart: I met my piano teacher Bernard Weiser, who studied with pianist Carl Friedburg, who took lessons from Clara Schumann, who met Goethe, who met Mozart. Cool.

This weird game hit the mainstream with the first productions in 1990 of the John Guare play Six Degrees of Separation, followed soon after by the film version. What happened next was silly and hilarious: somebody cooked up the idea of "Six Degrees of Kevin Bacon" to use this same chains-of-people idea, where two actors/actresses would be connected to one another if they both appeared in the same movie. Kevin Bacon was clearly chosen partly because he has acted in many, many movies, but also because "Kevin Bacon" sort of rhymes with "Separation." Want to play with this idea? You're in luck. You can go to The Oracle of Kevin Bacon to discover that Jennifer Lawrence is connected to silent film megastar Mary Pickford like so:

Mary Pickford was in The Poor Little Rich Girl (1917) with
Maxine Elliot Hicks, who was in Beethoven (1992) with
Oliver Platt, who was in X-Men: First Class (2011) with
Jennifer Lawrence

Around this same time, this same idea blossomed into a popular area of study in mathematics and the theory of algorithms, in part due to the emergence of online social networks that have generated gigantic datasets of interconnections between people and organizations.

For this project, you are going to use IMDB's movie data to recreate the Kevin Bacon oracle's main feature: given two movie performers, find the shortest sequence of other performers connecting them, along with the movies shared by successive pairs of performers in the chain.

I'm not going to specify a command line syntax for this project, but do make it as simple and intuitive as you can. You might also want to add some features to make the program more convenient to use. For example, you might allow your user to type:

java ChainFinder allmovies 'Ellen Page'

to get a list of all the movies Ellen Page has been in, or:

java ChainFinder cast 'Alien'

to get the cast list for Alien. But at minimum, you'll want to implement something like:

java ChainFinder chain 'Emma Stone' 'Peter Lorre'

to find the shortest path between those two actors.

Where to find the data? If you go to http://www.imdb.com/interfaces and select one of the links there (I found the German one to be fastest), you'll be able to download actors.list.gz, actresses.list.gz, and movies.list.gz to get the data you need. Note that one of the tricky parts of this project will be figuring out how to parse the data from these files into suitable Java objects at run-time.

What to hand in

Constraints

Both of these projects will require you to implement a graph (for the word ladders and the performer chains) on which you will peform breadth-first search. Because this is a project in a course called Data Structures, I expect you to implement your own graph class.

Want to use a list, an array, a stack, a queue, or a map/search-structure? Feel free to use the ones built in to Java.

Grading Criteria

Advice

And one more time...

Start early, ask questions, and have fun!