CS 201: Data Structures (Winter 2017)

HW07: Sorting

Due: Wednesday, 02/08 at 22:00

This is a pair programming assignment. You have new partners. Please make sure you know who your partners are.

1. Goals

To study the performance characteristics of several sorting algorithms, and hone your analysis skills.

2. Overview

In this assignment, you'll be doing a mixture of coding, designing and performing timing experiments, and analyzing your experimental results. You'll be thinking through what timing data makes sense to collect, and writing your reasoning in a short report.

This assignment is intended to provide you with a Quantitative Reasoning Encounter (QRE). Keeping this in mind should help you understand the purpose of this assignment and what we are looking for you to do.

You'll be investigating three separate questions over the course of the assignment.

You can download the starter files.

3. Investigation A: radix sort with strings

In this section, you'll be writing a radix sort for sorting words and exploring two versions:

A standard least-significant-"digit" radix sort, like we talked about in class.
A sort where you first divide into buckets based on the first letter ("most significant digit" or msd) and then use standard radix sort to sort the contents of each individual bucket.

For simplicity, you should assume your words are lowercase and contain only the 26 characters of the alphabet. For example, here's a correctly sorted ordering of four words:

go, goat, goatherd, grab

Radix sort implementations

Create a file RadixSort.java that includes at least two methods:

/** Uses standard least significant digit radix sort to sort the words.
    @param A list of words. Each word contains only lower case letters in a-z.
    @return A list with the same words as the input argument, in sorted order.
*/
public static List<String> radixSort(List<String> words);

/** A variation on radix sort that sorts the words into buckets by their
    initial letter, and then uses standard radix sort to separately sort each
    of the individual buckets. Recombines at the end to get a fully sorted
    list.
    @param A list of words. Each word contains only lower case letters in a-z.
    @return A list with the same words as the input argument, in sorted order.
*/
public static List<String> msdRadixSort(List<String> words);

We're using static methods here because the main purpose of this class will be to do some timing experiments, not for use by others. You can have any helper methods you like.

Radix sort investigation

Once you've implemented the two radix sorts and tested them to ensure they correctly sort words, investigate whether one of them is faster than the other. I've provided a words.txt file that you can use to have a big collection of words to sort (although you'll need to lowercase the words when you load them).
This file is currently in sorted order so you'll need to scramble the list before sorting; I recommend using the shuffle method in the Collections class.
You're welcome to use variations on this file (e.g., if you want to compare how performance varies for a somewhat smaller number of words versus a larger number of words). However, if you go this route, you'll want to think carefully about how your choices for creating the smaller file may influence your sorting performance and why.
In the main method of your class, include code that generates the timing data. That is, your code should load words, sort them and time the sorting, and then print out the results. If you include multiple sizes of word files, code for timing each of them should be included. Think of this as "reproducible science": anyone could look at your code to see exactly how you collected your data, and they could even try it themselves to see if their computer yields results in the same trends that you see. Take a look in the main method of QuickSort.java for an example of how to collect timing data.
Make sure that you time only the sorting (e.g., not the file loading)! Do not include any print statements in the region where you're timing, as printing takes an absurdly long time.
The first section of your report for this assignment should describe what differences you found between how long it takes to sort words using radixSort versus msdRadixSort. You should describe how you collected the data and present the data in a table or a graph (as in plot, not the kind with vertices and edges). You should also explain any hypotheses you have about why you see the differences in timing that you see, and whether you think there are particular kinds of data where the difference between the two would be smaller or larger (e.g., data where all of the "words" are strings of a's of varying length or very small data or ...). Justify your predictions and hypotheses using your timing data and/or your analysis of the algorithms. You're welcome to collect additional evidence beyond timing to support your claims (e.g., counting how many words are in each of the buckets in the msd sort).

4. Investigation B: quick sort and small lists

Typically, quick sort implementations switch to using a different sort for small arrays. This is because the overhead of the recursive calls and the constant time factors in quick sort are worse than the poorer asymptotic performance of the other sorting algorithm when $n$ is small. In your book, the sorting algorithm that is used for these small arrays is insertion sort.
I've provided you with a version of the book's quick sort code in QuickSort.java. Your job is to investigate how the performance of quick sort varies when this switch to insertion sort occurs. Write code in main that performs experiments varying when this switch to insertion sort occurs and collecting timing data. I've included an example of how to collect timing data. You may want to use one array size to compare a broad range of possible switch points, and then use a larger array to gather more fine-grained data differentiating a smaller range of switch points.
The second section of your report for this assignment should report on your investigation of quick sort, relating the size where it switches to insertion sort to quick sort's overall performance. Describe what data you collected and why it was appropriate, and present the data in a table or a graph. Explain any patterns you see in your data and any hypotheses you have for why these patterns occur.

5. Investigation C: Shell sort and gap distance

Shell sort performs a sequence of insertion sorts at varying gaps, starting with something large and eventually ending with a gap of one. Any sequence that contains a gap of one will result in a correct sort, but different gap sequences vary significantly in their performance.
A common gap sequence involves creating a sequence where each element differs from the last by some multiplicative factor $m$ plus one. For example, if $m$ is equal to $2$, then the gap sequence is $1$, $2\times1+1=3$, $2\times3+1=7$, $2\times7+1=15$, $\ldots$, where the $(i+1)$st element is equal to two times the $i$th element plus one. To use this gap sequence, you'd start with the largest element that is smaller than the length of the array you're sorting, and work backwards until you hit $1$.
Explore how changing the value of $m$ affects the performance of Shell sort. Specifically, you should try out gaps of $2$, $2.25$, and additional values of your choice to explore the pattern that arises as the gap changes. I'm including $2.25$ as that is a value that is used in many actual implementations of Shell sort; this is known as Tokuda's sequence for its inventor Naoyuki Tokuda. Since $2.25$ is not an integer, instead of using this sequence $1, 3.25, 8.3125, 19.703125, 45.33203125, \ldots$ as gaps, you should round up to the nearest integer and use $1, 4, 9, 20, 46, \ldots.$ You will find the ceil function helpful.
Include your timing experiments in the main method of ShellSort.java.
In the third and final section of your report, describe what you find about how the performance of Shell sort changes as the multiplicative factor used for the gap changes. For this section, you may have less clear hypotheses about why you see the performance you do and that's okay; do try to articulate what factors you believe may be leading to differences in timing based on what you know about Shell sort and why its performance can be better than insertion sort.

6. Report guidelines

Your report should describe the timing experiments you performed, the timing data that you collect, and answers to the questions posed in the three individual investigations. You should make sure to explain how the evidence you collected supports your claims, and explain your choices about what evidence to collect for each question. Your report should include tables and/or graphs to support your points.
Bring the quantative reasoning skills you've learned elsewhere to this assignment. I encourage you to think carefully about how to connect evidence and explanations. Imagine your report is intended to help some people make decisions related to these three investigations (e.g., when to switch to insertion sort for small lists). What information do they need to know? Why do you think the timing data looks the way it does based on what you know about how the algorithms work? You should try to design your experiments in a way that will be convincing to the reader. How could you compensate for the fact that timing data is "noisy," with the same code sometimes taking a little longer or shorter to run? You and/or your partner are welcome to chat with me about your approach in office hours.
Your report should be a PDF file no longer than 5 pages, including any charts and graphs. Note that no format other than PDF is acceptable.

7. Notes and tips

You may change the starter code I gave you in any way you please, although you should ensure you don't break the correctness of any of the sorts.
In your RadixSort class, please make sure the methods are named correctly and have the requested functionality. This allows us to use testing scripts on your code.
Make sure your timing code times only the thing you actually want to time, i.e., the sorting. It shouldn't include file loading or printing to the console, as both of these can take long and variable amounts of time that may swamp true differences in execution time.
You can use many different array sizes to explore the questions in this assignment. Make sure the arrays you're choosing are big enough to show real differences. If all of your experiments take only 1 or 2 milliseconds, it will be very hard to understand how your variations are related to differences in timing. Instead, use an array that's big enough to take at least hundreds of milliseconds to sort.
Think carefully about what evidence to collect, and explain your hypotheses about why you see the data that you do. Any time you make an assertion, think about how to support it with evidence from your timing or from the way the code works.
For the first investigation, you should be able to clearly explain why you see the evidence you see and relate it to the details of your implementations. For the second investigation, you may have less clear hypotheses about why you see the precise data that you do, but you should be able to explain the overall pattern and why it makes sense. For the third investigation, you may have less clear hypotheses about why you see a particular pattern and should instead focus simply on presenting the data and articulating any hypotheses that you do have.
If you're working with a partner, you should both contribute to all parts of the assignment: experimenting and writing things up. Your writing of the radix sort implementation should be done in standard paired programming style: both of you at the computer together, taking turns driving. Similarly, your timing code should also be done in paired programming style, and you should collaboratively decide what timing data to collect; either of you should be able to explain the results that you got from your timing experiments. You each may write drafts of parts of the writeup by yourselves, but you should edit each other's drafts and both of you should be aware of everything that's in the final, combined writeup; if you aren't able to explain why a particular section of the report includes the content it does and what the content means, then you have violated the collaboration guidelines for this assignment. If you're struggling with how to write a report like this, you might go to the Writing Center, located in the library.

8. Submission and grading

Submit to Moodle a zip archive containing:

RadixSort.java, with the methods defined above and main running your timing experiments
QuickSort.java, with main running your timing experiments
ShellSort.java, with main running your timing experiments
Your report, in PDF format, no longer than 5 pages including tables and graphs.

Grading

Radix sort implementations and timing code [9 points]
Thoroughness of investigation and clarity and argumentation of writeup, including speaking to the distinct questions in each section, making clear why your evidence supports the claims that you say it does, and relating back to what we've learned in class/reading when relevant [16 points]

This assignment modified from one designed by Anna Rafferty.

Start early, have fun, and discuss questions on Moodle.