CS 201: Data Structures (Winter 2017)

HW05: Path finder

Due: Monday, 01/30 at 22:00

This is a pair programming assignment. You will continue with your same partner (unless I have told you otherwise). We will switch partners after this assignment.

1. Goals

To apply your understanding of the Graph ADT and graph algorithms by implementing an algorithm to find shortest paths on Wikipedia data.

2. Setup

In this assignment you will write a PathFinder class that loads data into a graph and can find and display a shortest path between two vertices in the graph.

The code that you write for this assignment will build on top of a variety of ADT interfaces and implementations. I've included the Graph ADT and a graph implementation that we've talked about in class. For the graph implementation, you do not need to look at any of the source code. Instead, you can look at the javadocs to see how to construct graphs and what methods exist. Just as you don't need to look at the code for CarlStack to use a CarlStack, you don't need to look at or understand the CarlUnweightedGraph implementation to use it, and you may not modify the graph implementations. Download the zip file and unzip its contents into the same directory where you write your code for this assignment:

You may also find the documentation useful:

You are welcome to use any other data structures we have talked about as you design your PathFinder.

The hand-in for this assignment will be three files: PathFinder.java and two files for a sample graph (described more below).

3. Specification

What your program should do

You may have played a game where starting from some Wikipedia page, you try to get to another Wikipedia page in as few pages as possible (there's even a Wikipedia page about the idea). As we've talked about in class, you can think about the web as a directed graph, where a link from one page to another corresponds to an edge. The pages and links between pages in Wikipedia correspond to a subgraph of the graph of the entire web, and the goal of this assignment is to find a shortest path between two vertices in that subgraph.

You'll write your code in the PathFinder class. Usage for PathFinder should be:

java PathFinder vertexFile edgeFile
where vertexFile is the file that has the vertex names for your graph and edgeFile is the file that has the edges for your graph. We'll describe the format of those files in a minute. PathFinder should load the data, and then choose two random vertices, a start vertex and an end vertex. The program should find a shortest path from the start vertex to the end vertex and then display the length of the path you found and the complete path.

For example, if the articles randomly chosen were Star_Trek and Lyme_disease, your program might output:

Path from Star_Trek to Lyme_disease, length = 4:
Star_Trek --> Earth --> Animal --> Bacteria --> Lyme_disease
Note that the length of the path is the number of edges. If there is no path between the two articles (which can happen!), the program should still indicate which articles were chosen and that no path was found.

The second usage case is:

java PathFinder vertexFile edgeFile --three
This usage behaves exactly like the first one except your program chooses three random vertices and finds a path from the first vertex to the second vertex that goes through the third vertex. This path is a shortest path given the constraint that it must pass through the third vertex (i.e., it may very well be longer than the actual shortest path if you did not have to pass through the third vertex).

Inside the PathFinder class, you must have the following constructor and methods:

/** Constructs a PathFinder that represents the graph with vertices
    specified as in vertexFile and edges specified as in edgeFile.
    @param vertexFile Name of the file with the vertex names.
    @param edgeFile Name of the file with the edge names.
*/
public PathFinder(String vertexFile, String edgeFile);
 
/** Returns the length of a shortest path from vertex1 to vertex2.  If no
    path exists, returns -1.  If the two vertices are the same, the path
    length is 0.
    @param vertex1 Name of the starting article vertex.
    @param vertex2 Name of the ending article vertex.
    @return Length of shortest path.
*/
public int getShortestPathLength(String vertex1, String vertex2);
 
/** Returns a shortest path from vertex1 to vertex2, represented as list
    that has vertex1 at position 0, vertex2 in the final position, and the
    names of each vertex on the path (in order) in between.  If the two
    vertices are the same, then the "path" is just a single vertex.  If no
    path exists, returns an empty list.
    @param vertex1 Name of the starting article vertex.
    @param vertex2 Name of the ending article vertex.
    @return List of the names of vertices on a shortest path.
*/
public List<String> getShortestPath(String vertex1, String vertex2);
 
/** Returns a shortest path from vertex1 to vertex2 that includes the vertex
    intermediatevertex, with vertex1 at position 0, vertex2 in the final
    position.  This may not be the absolute shortest path between vertex1
    and vertex2, but should be a shortest path given the constraint that
    intermediatevertex appears in the path.  If all three vertices are the
    same, the "path" is just a single vertex.  If no path exists, returns an
    empty list.
    @param vertex1            Name of the starting article vertex.
    @param intermediatevertex Name of the intermediate article vertex.
    @param vertex2            Name of the ending article vertex.
    @return List of the names of vertices on a shortest path.
*/
public List<String> getShortestPath(String vertex1, String intermediatevertex, String vertex2);

Please make sure you have methods that exactly match these signatures. Often, users of a class and designers of a class will agree ahead of time on what methods to have. This allows the users of a class to start writing their code that will use the class even before the class has been written. In the case of your homework, having these methods allows us to write code that will test your methods.

Beyond having those three methods, you can design PathFinder however you like. As you're designing your class, think about what ADTs make the most sense to use for what purposes. Make sure to use an object-oriented design and follow the style guide.

Data

In the zip file for this assignment, you'll see two data files: articles.tsv and links.tsv. If you look at each file, you'll see that articles.tsv has Wikipedia articles, one on each line, and links.tsv has links from one article to another. For example,

10th_century	Solar_System
means there is a link from the article 10th_century to the article Solar_System. The two article names are separated by a tab character, which can be represented as \t in Java. The articles and links are a subset of Wikipedia from 2007. Your program should be able to load files in this form. Take a minute to read look at the file formats: pay attention to what sorts of lines indicate comments (which shouldn't be a part of the graph!). You'll also see that the article names have been "URL-encoded": special characters are encoded so they can be typed in as URLs. The comments tell you how to decode these article names so you can display them properly, e.g.,
String humanReadableArticleName = java.net.URLDecoder.decode(articleName, "UTF-8")
Take a look at the documentation if you want to learn more about encoding and decoding URLs. You should display the decoded strings when showing the path. You do not need to make any other changes to the names: underscores, for instance, should be left as underscores rather than changed into spaces; occasionally a character will still appear oddly. (The latter is due to whatever character set your command-line program is using to display.)

While looking at the Wikipedia data is fun, it can be hard to debug your program using this large dataset. Thus, you should also make one or more small datasets of your own so you can check that your program correctly finds shortest paths. You will probably make multiple test data sets, but at a minimum, you should make one with files testVertices.txt and testEdges.txt. These should have the same format as the Wikipedia datasets. In the comments in the testVertices.txt file, you should include something about what the graph is that you're making and what values you'd expect for particular pairs of vertices. (Then verify that you get the correct results when testing your program with these small graphs.) Note that if it's helpful for debugging, you can augment your usage cases so that the third and fourth command line arguments are the start and end vertices for the path you want to find. This isn't required, but might be very helpful!

4. Code notes

5. Submission and grading

Submit to Moodle a zip archive containing: Follow the previously established guidelines, in particular, do not submit any starter files (Wikipedia dataset, Graph interface/implementation).

Assignment requirements

This is a partial list of the things that we'll be looking for when evaluating your work; as with all assignments, we're also looking for good style:

Grading

This assignment modified from one designed by Anna Rafferty.

Start early, have fun, and discuss questions on Moodle.