CS 257: Software Design

Books: test-driven development, command-line arguments, and CSV

Folder name (in your git repository): books

I assigned you a new random partner Tuesday, 9/21 (check Slack #announcements).

Table of contents

This project involves several due dates for subtasks, so a .

Goals

Background: data and command-line interfaces

The data: books, authors, and comma-separated values

Take a look at books1.csv, a file full of data about books and authors. There are only a few dozen books in this dataset, so you wouldn't want to base an important book-related application on it. But for learning about how to manipulate datasets like this, a couple dozen books will be plenty.

Note, by the way, that when you look at a CSV file on github.com, the GitHub user interface renders the file in pretty columns, but the file itself is just text and doesn't look pretty when you open it with vim or Atom or whatever. To get your hands on files that are stored in my GitHub repo for this class, you should just clone it:

git clone https://github.com/ondich/cs257_2021_fall jeffs_repo
or something like that. You'll want to go into your clone of my repo periodically and do a "git pull" since I'll be putting all my samples in there.

Back to books1.csv. The file's format is known as comma-separated values (CSV). CSV is a very simple format used to store tables of textual data. Each line of text represents a row in the table, and the fields/columns in each row are separated by commas. These few lines illustrate the principle: "title,publication year,author"

Jane Eyre,1847,Charlotte Brontë (1816-1855) To Say Nothing of the Dog,1997,Connie Willis (1945-) The Stone Sky,2017,N.K. Jemisin (1972-)

The only thing that makes CSV at all tricky is when the data in one of the table cells contains either a comma or a newline character. For example, consider the novel "Right Ho, Jeeves" by P.G. Wodehouse. If you just comma-separate the fields, you get this:

Right Ho, Jeeves,1934,Pelham Grenville Wodehouse (1881-1975)

which will make software misinterpret " Jeeves" as the second column of this row, instead of the tail end of the first column. CSV solves this problem with quotation marks:

"Right Ho, Jeeves",1934,Pelham Grenville Wodehouse (1881-1975)

But of course now you have the question of what happens if the title of your book includes quotation marks. You should read up on how CSV handles these situations.

For this and possibly future assignments, you'll be using the books1.csv file as your database. Your programs will read data from this file as needed to satisfy the requirements of the assignment. To do this, you'll use Python's csv module.

Command-line arguments in Python

Writing programs that use command-line arguments to determine their behavior is an important skill. In my day-to-day life as a programmer, I write a lot of short programs (and some long ones) to do all manner of tasks for me. Sometimes in very very short programs that I plan to run exactly once, I'll hard-code input values into the program. Those programs are often like: "Open file something.txt, read its contents, do something with the contents, and print out the results". In cases like this, I'll often just put the "something.txt" right in the code (also known as hard-coding the filename).

But even when I expect to run a program only once, I generally have to run it a few times during debugging, and then I often find that it's more useful than I thought, and I end up running it on multiple different input files, sometimes sorting the output one way, other times sorting the output another way, and so on. In such cases, I always wish I had taken one or two minutes to set up a sensible command-line argument syntax for the program.

For this assignment, you are going to write a command-line tool for extracting information from the books dataset. The assignment will have three phases. First, you will design a command-line interface for the tool. Second, you'll Then, after revising your design based on feedback from discussion group, you will implement the resulting interface.

Task #1: Command-line design

due 11:59PM Thursday, September 23

You can easily imagine many features appropriate for a command-line tool concerned with a books-and-authors dataset. Since this project is less about the utility of the final product than about the techniques we use to create it, we're going to restrict this program to the following features:

Here's what you need to do for this task:

  1. Select one partner's cs257 git repository to work in. Make sure all partners are given push-access to the repository, and that all partners have a local clone.
  2. Create a folder called "books" at the top level of the repository. (Note the very top of this web page, where I have the notation "Folder name: books". For future assignments, you should use that "Folder name" indicator to tell you where to store your work for the assignment. This is how the grader and I will find your work.)
  3. Prepare a first draft of your program's command-line syntax, and write a short usage/help statement for the program. Put these usage statements in books/usage.txt. Add, commit, and push this so I can see it in your repository.

You can use the standard Unix manual pages as a model for how to write a command-line syntax synopsis and a usage statement. Take a look at "man mv", etc.

At the beginning of class on Friday, I will provide feedback on a handful of representative command-line designs and usage statements, after which you should revise your usage.txt before doing Task #3.

Task #2: Unit tests

due 11:59PM Monday, September 27

One purpose of this multi-task project is to give you an introduction to test-driven development (TDD). Roughly, the process goes like this:

For us, the class in question will be called BooksDataSource, and its purpose will be to provide Python programmers with convenient access to the data in our books dataset.

The trick to writing good unit test suites is to think deeply about the many ways your interfaces might be called. Your tests should, for example, test typical cases, weird cases, and illegal cases. (For a really simple example, a unit test suite for a square-root function ought to include attempts to compute the square-roots of positive integers, positive non-integers, negative numbers, and zero, and depending on the language and the completeness of the interface specification, maybe the square-root of "moose" or other non-numerical input.) You should think hard about the mistakes programmers can make, the bad data users can generate, and the ways malicious programmers might try to exploit errors or omissions in your code.

For Task #2, your jobs are:

Grading rubric for Task #2:

1 - author names appear in comment at the top of booksdatasourcetests.py 2 - tests run when combined with the unimplemented booksdatasource.py [we expect most of the tests to fail, of course] 5 - good variety of tests covering both typical and boundary cases [between 10 and 20 tests should be about right] 2 - test code is reasonably easy to read and understand 2 - test code appears to be correct [this will be hard to be certain of until we have implemented booksdatasource.py and debugged both the implementation and the tests]

Task #3: Implementation (first draft)

due 11:59PM Saturday, October 2

Time to write the program itself!

Grading rubric for Task #3:

1 - usage.txt, makes sense, and supports all features required by the assignment 1 - comment with author names at top of books.py 2 - student's unit tests for booksdatasource.py pass 3 - grader's unit tests for booksdatasource.py pass 1 - user can get help from the command line 3 - required command-line operations work correctly as described in usage.txt 4 - code organization quality, including quality of naming

Task #4: Code review

In 3-4 separate sessions on October 5 and 6. I'll announce the dates and locations on our Slack #announcement channel a few days ahead of time.

Here are the instructions for preparing for the code review.

Grading rubric for Task #4:

2 - You provided feedback to three other teams as specified 6 - your feedback was helpful (see instructions for suggestions on how to be helpful)

Task #5: Revision

due 11:59PM Monday, October 11

Here are the instructions for preparing a revision of books.py.

Grading rubric for Task #5:

4 - quality of responsiveness to code review feedback 6 - quality of the final code

Implementing a command-line interface

There are two main approaches to implementing a command-line syntax: handle the command-line in its raw form (i.e. the list of strings sys.argv) or use a Python module designed to make command-line parsing easier.

For extremely simple programs, using sys.argv directly can be the easiest way to go. Here's a simple example of using sys.argv to parse command-line arguments.

For any program whose command line is going to have a little bit of complexity, it's usually better to use a module like argparse instead of using sys.argv directly. Here's a brief argparse example that you might find helpful: argparse_example.py.

There are many command-line-parsing modules for Python: argparse, getopt, docopt, optparse, click, etc. I'm suggesting argparse for this project because it comes standard with any installation of Python, and it is illustrative of the power (and sometimes the frustration) of using a module like this.

Constraints and suggestions

Start early, ask questions, and have fun!

(And don't forget Slack—our #questions channel is meant for you!)