CS 201 Assignment

Work alone or with a partner of your choice. If you would like to find a partner but don't have anyone in mind, post your availability on the #general channel on Slack.

Goals

Show off your mad Java skillz (that's the hip lingo from about 2003, right?)
Use one of your new data structures to solve an interesting problem.
In particular, use a graph and breadth-first search to do something that would take a huge amount of time if you used a more simple-minded approach.

I. Project Option #1: A Word Game Helper

Submit to Moodle as WordGameHelper.java

You'll need a dictionary file (i.e. a list of words you're going to consider legal). Here's a large list of English words.

I like word games. I do the New York Times crossword every day, and I will play Boggle with anybody willing to play with me (and when they're not, I just play alone on my phone). Scrabble, Bananagrams, acrostics, Perquackey, Text Twist, just fooling around with anagrams? Bring 'em on.

Sometimes I play my games strictly—no dictionaries and no web searches allowed. But sometimes it's fun to cheat. What's the best possible Scrabble word I could make out of these seven letters? Can I find the Boggle board with the largest possible number of words on it? What words could go into 17-down in my current crossword puzzle?

If you choose this project option, you will write a program that answers question #1 below and also either question #2 or question #3 (i.e. you only have to implement two of these, but one of them has to be #1):

Given two words of identical length, what's the shortest word ladder you can construct between them? (e.g. you can go from HATE to LOVE in three steps: HATE - LATE - LAVE - LOVE, or HATE - HAVE - HOVE - LOVE, etc.).
Given a string of letters, find all their one-word anagrams. (e.g. if your string is PART, you'll get PART, TARP, TRAP, and PRAT.)
Given a string of letters and asterisks, find all the words of the same length as your search string that match the letters exactly. That is, the asterisks in your search string are a "wildcard" character that means "any letter can go here." For example, if your search string is G**W, you would get GLOW, GNAW, GNOW, GREW, and GROW. (A dictionary tells me that "gnow" is "the Mallee Fowl of Western Australia." So now you gnow.) This feature, by the way, is extremely handy for crossword puzzles.

Your program's command-line syntax should be:

java WordGameHelper ladder startWord endWord java WordGameHelper anagram string java WordGameHelper wildcard string

(only the two that are relevant to your project, of course).

NOTE: since your Unix shell will interpret * in its own special way, you'll need to put wildcard search strings between single-quotes, like:

java wildcard 'G**W'

II. Project Option #2: Six Degrees of Mary Pickford

Submit to Moodle as ChainFinder.java

A very long time ago, I started playing around with what my family referred to as the "handshake game." The idea is that you try to construct chains from one person to another, where each link is a pair of people who have met each other (they've shaken hands, at least metaphorically). The example I usually use as an example is this five-step chain from me to Mozart:

I shook hands with my piano teacher Bernhard Weiser, who studied with pianist Carl Friedburg, who took lessons from Clara Schumann, who met Johann Wolfgang von Goethe, who met Wolfgang Amadeus Mozart.

Cool.

This weird game hit the mainstream with the first productions in 1990 of the John Guare play Six Degrees of Separation, followed soon after by the film version. What happened next was silly and hilarious: somebody cooked up the idea of "Six Degrees of Kevin Bacon" to use this same chains-of-people idea, where two actors/actresses would be connected to one another if they both appeared in the same movie. Kevin Bacon was clearly chosen partly because he has acted in many, many movies, but also because "Kevin Bacon" sort of rhymes with "Separation." Want to play with this idea? You're in luck. You can go to The Oracle of Kevin Bacon to discover that Regina King is connected to silent film megastar Mary Pickford like so:

Mary Pickford was in A Little Princess (1917) with
ZaSu Pitts, who was in Paris (1929) with
Jason Robards, who was in Enemy of the State (1998) with
Regina King

(Watch out, though. The Oracle is sometimes a bit too simple-minded. It turns out that Jason Robards, Sr. was in Paris, while his son, Jason Robards, Jr., was the one in Enemy of the State.)

Around this same time, the same idea blossomed into a popular area of study in mathematics and the theory of algorithms, in part due to the emergence of online social networks that have generated gigantic datasets of interconnections between people and organizations.

For this project, you are going to use IMDB's movie data to recreate the Kevin Bacon oracle's main feature: given two movie performers, find the shortest sequence of other performers connecting them, along with the movies shared by successive pairs of performers in the chain.

I'm not going to specify a command line syntax for this project, but do make it as simple and intuitive as you can. You might also want to add some features to make the program more convenient to use. For example, you might allow your user to type:

java ChainFinder allmovies 'Ellen Page'

to get a list of all the movies Ellen Page has been in, or:

java ChainFinder cast 'Alien'

to get the cast list for Alien. But at minimum, you'll want to implement something like:

java ChainFinder chain 'Emma Stone' 'Peter Lorre'

to find the shortest path between those two actors.

Data

I grabbed the data made available by IMDb for non-commercial use, and did a little preprocessing on it. The resulting data consists of three comma-separated values files:

actors.csv, each of whose lines contains:
unique id,actor name,birth year,death year
where "death year" is the empty string if the actor is still alive.
movies.csv, each of whose lines contains:
unique id,movie title,year
casts.csv, each of whose lines contains:
movie id,actor id,list of characters played by this actor in this movie

NOTE: I was unable to find an online service that provides rich movie cast data in an easily downloadable form. IMDb's downloadable data is fine as far as it goes, but they only include the most prominent cast members in each movie. For example, "Star Wars: Episode IV - A New Hope" only includes Mark Hamill, Carrie Fisher, Harrison Ford, and Alec Guinness as cast members. Thus, your graph will be a lot sparser than the graph used by The Oracle of Kevin Bacon, which gets its data by downloading and parsing the entirety of Wikipedia in search of movie casts. Your graph's connections will mean something like "these two actors costarred in a movie" as opposed to "these two actors acted in the same movie". That's OK, but it's important to be aware of before you start testing your program.

ANOTHER NOTE: I used only the English movie titles provided by IMDb. So, for example Das Leben der Anderen appears in movies.csv as The Lives of Others.

III. What to hand in for either project

A readme file containing:
- A description of your program and its features.
- A description of your program's command-line syntax.
- A description of the main data structures your program uses.
- A discussion of the current status of your program, what works and what doesn't, etc.
Your program's source code.

IV. Constraints

Both of these projects will require you to implement a graph (for the word ladders and the actor chains) on which you will peform breadth-first search. Because this is a project in a course called Data Structures, I expect you to implement your own graph class.

Want to use a list, an array, a stack, a queue, or a map/search-structure? Feel free to use the ones built in to Java.

V. Grading Criteria

Successful compilation. If your program doesn't compile, I can't grade it.
Correctness. Your program needs to do the job it is intended to do.
Design. I will look for well-considered choices of data structures, classes, and method signatures.
Style. I want to see good indentation, descriptive variable and function names, well-placed comments, consistent loop structure, and so on. Ideally, your code will be a pleasure to read. (This assumes a reader who enjoys reading code, but you're in luck, since I am such a reader.)
Documentation. Your description of your project and the comments in your source code are important parts of your project. I want to be able to understand your project fairly well before diving into the code.
Performance. I will not run precise time tests, but I will frown on programs that take 10 minutes to process a single word ladder or pair of actors.

VI. Advice

Plan your program on paper. Do not start your work at the computer. For relatively large programs like these, forethought can save you a lot of time.
Design classes and methods that will make the main program easy to write. You want your methods to provide services to the rest of your code, and good design of those services can make the difference between code that is very straight-forward to write and code that acts like a tar pit. Think carefully about your classes. It's worth the time.
Make an incremental development plan. That is, make a list of things you will make your program do, in the order in which you will write the code. Each stage of your development should be a small step that can be compiled and tested (and backed up!) before you move on to the next step.
Ask questions. I will be very happy to help you when you're stuck. Don't wait. You only have two weeks, and there's a takehome exam in there, too.

VII. And one last time...

Start early, ask questions, and have fun!

CS 201: Data Structures

Final project