CS 201: Data Structures (Spring 2018)
HW05: Path finder
Due: Monday, 04/23 at 22:00
1. Goals
To apply your understanding of the Graph ADT and graph algorithms by implementing an algorithm to find shortest paths on Wikipedia data.
2. Setup
This is a pair programming assignment. You will continue with your same partner (unless I have told you otherwise).
In this assignment you will write a PathFinder
class that loads
data into a graph and can find and display a shortest path between two
vertices in the graph.
The code that you write for this assignment will build on top of a variety
of ADT interfaces and implementations. I've included the Graph ADT and a graph
implementation that we've talked about in class. For the graph implementation,
you do not need to look at any of the source code. Instead, you can look
at the javadocs to see how to construct graphs and what methods exist. Just as
you don't need to look at the code for CarlStack
to use a
CarlStack
, you don't need to look at or understand the
CarlUnweightedGraph
implementation to use it, and you may not
modify the graph implementations. Download the zip file and unzip its
contents into the same directory where you write your code for this assignment:
You may also find the documentation useful:
You are welcome to use any other data structures we have talked about as you
design your PathFinder
.
The hand-in for this assignment will be three files:
PathFinder.java
and two files for a sample graph (described more
below).
3. Specification
What your program should do
You may have played a game where starting from some Wikipedia page, you try to get to another Wikipedia page in as few pages as possible (there's even a Wikipedia page about the idea). As we've talked about in class, you can think about the web as a directed graph, where a link from one page to another corresponds to an edge. The pages and links between pages in Wikipedia correspond to a subgraph of the graph of the entire web, and the goal of this assignment is to find a shortest path between two vertices in that subgraph.
You'll write your code in the PathFinder
class. Usage for
PathFinder
should be:
java PathFinder vertexFile edgeFile
where vertexFile
is the file that has the vertex names for your
graph and edgeFile
is the file that has the edges for your graph.
We'll describe the format of those files in a minute. PathFinder
should load the data, and then choose two random vertices, a start vertex and
an end vertex. The program should find a shortest path from the start vertex to
the end vertex and then display the length of the path you found and the
complete path.
For example, if the articles randomly chosen were Pythagorean_theorem and Pride_and_Prejudice, your program might output:
Shortest path from Pythagorean_theorem to Pride_and_Prejudice, length = 4 Pythagorean_theorem --> British_Isles --> United_Kingdom --> Jane_Austen --> Pride_and_Prejudice
Note that the length of the path is the number of edges. If there is no path between the two articles (which can happen!), the program should still indicate which articles were chosen and that no path was found.
Inside the PathFinder
class, you must have the following constructor and methods:
/** Constructs a PathFinder that represents the graph with vertices specified as in vertexFile and edges specified as in edgeFile. @param vertexFile Name of the file with the vertex names. @param edgeFile Name of the file with the edge names. */ public PathFinder(String vertexFile, String edgeFile); /** Returns the length of a shortest path from vertex1 to vertex2. If no path exists, returns -1. If the two vertices are the same, the path length is 0. @param vertex1 Name of the starting article vertex. @param vertex2 Name of the ending article vertex. @return Length of shortest path. */ public int getShortestPathLength(String vertex1, String vertex2); /** Returns a shortest path from vertex1 to vertex2, represented as list that has vertex1 at position 0, vertex2 in the final position, and the names of each vertex on the path (in order) in between. If the two vertices are the same, then the "path" is just a single vertex. If no path exists, returns an empty list. @param vertex1 Name of the starting article vertex. @param vertex2 Name of the ending article vertex. @return List of the names of vertices on a shortest path. */ public List<String> getShortestPath(String vertex1, String vertex2);
Please make sure you have methods that exactly match these signatures. Often, users of a class and designers of a class will agree ahead of time on what methods to have. This allows the users of a class to start writing their code that will use the class even before the class has been written. In the case of your homework, having these methods allows us to write code that will test your methods.
Beyond having those methods above, you can design PathFinder
however you like. As you are designing your class, think about what ADTs make
the most sense to use for what purposes. Make sure to use an object-oriented
design and follow the style guide.
Data
In the zip file for this assignment, you'll see two data files:
articles.tsv
and links.tsv
. If you look at each file,
you'll see that articles.tsv
has Wikipedia articles, one on each
line, and links.tsv
has links from one article to another. For
example,
10th_century Solar_System
means there is a link from the article 10th_century to the article
Solar_System. The two article names are separated by a tab
character, which can be represented as \t
in Java. The articles
and links are a subset
of Wikipedia from 2007. Your program should be able to load files in this
form. Take a minute to look at the file formats: pay attention to what
sorts of lines indicate comments (which shouldn't be a part of the graph!).
You'll also see that the article names have been "URL-encoded": special
characters are encoded so they can be typed in as URLs. The comments tell you
how to decode these article names so you can display them properly,
e.g.,
String humanReadableArticleName = java.net.URLDecoder.decode(articleName, "UTF-8");
Take a look at the documentation if you want to learn more about encoding and decoding URLs. You should display the decoded strings when showing the path. You do not need to make any other changes to the names: underscores, for instance, should be left as underscores rather than changed into spaces; occasionally a character will still appear oddly. (The latter is due to whatever character set your command-line program is using to display.)
While looking at the Wikipedia data is fun, it can be hard to debug your
program using this large dataset. Thus, you should also make one or more small
datasets of your own so you can check that your program correctly finds
shortest paths. You will probably make multiple test data sets, but at a
minimum, you should make one with files testVertices.txt
and
testEdges.txt
. These should have the same format as the Wikipedia
datasets. In the comments in the testVertices.txt
file, you should
include something about what the graph is that you're making and what values
you'd expect for particular pairs of vertices. (Then verify that you get the
correct results when testing your program with these small graphs.) Note that
if it's helpful for debugging, you can augment your usage cases so that the
third and fourth command line arguments are the start and end vertices for the
path you want to find. This isn't required, but might be very helpful!
4. Code notes
- Just like for the
ZooDisplayer
project, think of breaking this assignment into parts with an incremental development plan. For example, I would first write a load method and test that. Then I'd write a method to find the path between two particular vertices. Then I'd figure out how to choose two vertices at random, and so on. Take things in small bits, and test as you go. - While the "big picture" goal of this assignment is to use a graph with Wikipedia data, you should be testing your path finding algorithm on smaller graphs where it's easier to verify that the algorithm performs correctly. Make sure to test a variety of graph types and think about edge cases: empty graphs, complete graphs, graphs where some pairs of vertices don't have a path between them, and so on.
- In class, we talked about the fact that a shortest path can be found using breadth-first search. Write your own code. You may refer to the algorithm written as pseudocode in Segment 28.19 in the textbook, but you should refrain from viewing (a fortiori copying) other code.
- We also brainstormed some challenges in class: storing the names that go with each vertex and keeping track of the predecessor of a vertex when you visit it. Keeping track of the predecessor of a vertex will allow you to trace back the path to your start vertex when you finally find the vertex you're looking for. What data structures have we learned about in class that can help you solve these challenges?
- There's a lot of flexibility in this assignment, which can be fun or daunting. It's very important that you get started early, start off by thinking about the big picture design, and that you come ask questions if you run into troubles. If you get stuck, it's much more productive to take a break and ask a question the next day in office hours than it is to stay up all night beating your head against the wall.
- (Not so much a code note, but...) If you're curious why anyone would want to know the shortest paths between pages in Wikipedia, you might be interested in looking up the papers listed in the comments on the Wikipedia dataset. The idea of using graph metrics is popular for understanding lots of big datasets, and for concept similarity, has been around since at least 1989 when Rada et al. [PDF] proposed using shortest path to measure distances in meaning between words.
- Optional extra challenge:
Add a second usage case:
java PathFinder vertexFile edgeFile --three
This usage behaves exactly like the first one except your program chooses three random vertices and finds a path from the first vertex to the second vertex that goes through the third vertex. This path is a shortest path given the constraint that it must pass through the third vertex (i.e., it may very well be longer than the actual shortest path if you did not have to pass through the third vertex). The main thing you should focus on here is minimizing code duplication.
5. Submission and grading
Submit to Moodle azip
archive containing:
PathFinder.java
,- the two test files for your test graph.
Assignment requirements
This is a partial list of the things that we'll be looking for when evaluating your work; as with all assignments, we're also looking for good style:
- Program uses an appropriate algorithm to find a shortest path between two vertices.
PathFinder
follows an object-oriented design pattern.PathFinder
has constructors/methods that match exactly the signatures defined above.- The required path-finding/path length methods return correct paths and lengths.
- ADTs are used appropriately.
- Use the Graph interfaces and implementation that I provided without alteration.
- Usage matches that described in the assignment and the program displays a shortest path between two vertices (or indicates there isn't one).
- Test graph files are present and you've correctly indicated some of the path lengths in your comments.
Grading
- Assignment requirements [20 points]
- Comments, style, and design [5 points]
This assignment modified from one designed by Anna Rafferty.
Start early, have fun, and discuss questions on Moodle.