Books: test-driven development, command-line arguments, and CSV

Folder: books

This project involves several due dates for subtasks, so a .

Goals
Background
Phase #1: CLI design [due 9/20]
Phase #2: Unit tests [due 9/23]
Phase #3: Implementation first draft [due 9/29]
Phase #4: Code review [in class 10/3]
Phase #5: Revision [due 10/7]

Goals

Think about command-line interface design issues.
Learn how to work with command-line arguments using Python's argparse module.
Learn the basics of reading comma-separated values files using Python's csv module
Practice using test-driven development (TDD) and Python's unittest module (also known as PyUnit).

Background

The data: books, authors, and comma-separated values

Take a look at books1.csv, a file full of data about books and authors. There are only a few dozen books in this dataset, so you wouldn't want to base an important book-related application on it. But for learning about how to manipulate datasets like this, a couple dozen books will be plenty.

Note that when you look at a CSV file on github.com, the GitHub user interface renders the file in pretty columns, but the file itself is just text and doesn't look pretty when you open it with vim or VS Code or whatever. To get your hands on files that are stored in my GitHub repo for this class, you should just clone it:

git clone https://github.com/ondich/cs257_2022_fall jeffs_repo

or something like that. You'll want to go into your clone of my repo periodically and do a "git pull" since I'll be putting samples there all term.

Back to books1.csv. The file's format is known as comma-separated values (CSV). CSV is a very simple format used to store tables of textual data. Each line of text represents a row in the table, and the fields/columns in each row are separated by commas. These few lines illustrate the principle: "title,publication year,author"

Jane Eyre,1847,Charlotte Brontë (1816-1855) To Say Nothing of the Dog,1997,Connie Willis (1945-) The Stone Sky,2017,N.K. Jemisin (1972-)

The only thing that makes CSV at all tricky is when the data in one of the table cells contains either a comma or a newline character. For example, consider the novel "Right Ho, Jeeves" by P.G. Wodehouse. If you just comma-separate the fields, you get this:

Right Ho, Jeeves,1934,Pelham Grenville Wodehouse (1881-1975)

which will make software misinterpret " Jeeves" as the second column of this row, instead of the tail end of the first column. CSV solves this problem with quotation marks:

"Right Ho, Jeeves",1934,Pelham Grenville Wodehouse (1881-1975)

But of course now you have the question of what happens if the title of your book includes quotation marks. You should read up on how CSV handles these situations.

For this assignment, you'll be using the books1.csv file as your database. Your programs will read data from this file as needed to satisfy the requirements of the assignment. To do this, you'll use Python's csv module.

Command-line arguments in Python

Writing programs that use command-line arguments to determine their behavior is an important skill. In my day-to-day life as a programmer, I write a lot of short programs (and some long ones) to do all manner of tasks for me. Sometimes in very very short programs that I plan to run exactly once, I'll hard-code input values into the program. Those programs are often like: "Open file something.txt, read its contents, do something with the contents, and print out the results". In cases like this, I'll often just put the "something.txt" right in the code (also known as hard-coding the filename).

But even when I expect to run a program only once, I generally have to run it a few times during debugging, and then I often find that it's more useful than I thought, and I end up running it on multiple different input files, sometimes sorting the output one way, other times sorting the output another way, and so on. In such cases, I always wish I had taken one or two minutes to set up a sensible command-line argument syntax for the program.

For this assignment, you are going to write a command-line tool for extracting information from the books dataset. The assignment will have three phases. First, you will design a command-line interface for the tool. Second, you'll Then, after revising your design based on feedback from discussion group, you will implement the resulting interface.

Phase #1: Command-line design

Due 9/20 (I forgot to specify a git tag until late, so no tag required for this one)

You can easily imagine many features appropriate for a command-line tool concerned with a books-and-authors dataset. Since this project is less about the utility of the final product than about the techniques we use to create it, we're going to restrict this program to the following features:

Given a search string S, print a list of books whose titles contain S (case-insensitive). Books may be sorted by title or by publication year.
Given a search string S, print a list of authors whose names contain S (case-insensitive). For each such author, print a list of the author's books. Authors should be printed in alphabetical order by surname, breaking ties by using given name (e.g. Ann Brontë comes before Charlotte Brontë).
Given a range of years A to B, print a list of books published between years A and B, inclusive. Books should be printed in order of publication year.
If the user requests a usage statement (via a suitable command-line flag) or if the user's command-line syntax is invalid, print a suitable usage statement.

Here's what you need to do for this task:

Select one partner's cs257 git repository to work in. Make sure all partners are given push-access to the repository, and that all partners have a local clone.
Create a folder called "books" at the top level of the repository. (Note the very top of this web page, where I have the notation "Folder: books". For future assignments, you should use that "Folder" indicator to tell you where to store your work for the assignment. This is how the grader and I will find your work.)
Prepare a first draft of your program's command-line syntax, and write a short usage/help statement for the program. Put these usage statements in books/usage.txt. Add, commit, and push this so I can see it in your repository.

You can use the standard Unix manual pages as a model for how to write a command-line syntax synopsis and a usage statement. Take a look at "man mv", etc.

At the beginning of class on Friday, I will provide feedback on a handful of representative command-line designs and usage statements, after which you should revise your usage.txt before doing Phase #3.

Phase #2: Unit tests

Due 9/23, git tag books-tests

One purpose of this multi-task project is to give you an introduction to test-driven development (TDD). Roughly, the process goes like this:

You start with a class whose interfaces have been written and agreed upon. (For our purposes, an interface will refer to a method signature plus the descriptive comment that goes with it.)
You write a collection of unit tests to thoroughly test the agreed-upon interfaces. You do this before implementing the interfaces.
You implement the interfaces, using the unit tests to help you debug and to give you a way to determine whether you're done with the implementation.

For us, the class in question will be called BooksDataSource, and its purpose will be to provide Python programmers with convenient access to the data in our books dataset.

The trick to writing good unit test suites is to think deeply about the many ways your interfaces might be called. Your tests should, for example, test typical cases, weird cases, and illegal cases. (For a really simple example, a unit test suite for a square-root function ought to include attempts to compute the square-roots of positive integers, positive non-integers, negative numbers, and zero, and depending on the language and the completeness of the interface specification, maybe the square-root of "moose" or other non-numerical input.) You should think hard about the mistakes programmers can make, the bad data users can generate, and the ways malicious programmers might try to exploit errors or omissions in your code.

For Phase #2, your jobs are:

If you haven't completed the unit tests lab yet, do so first.
Take a look at the books1.csv data file. This is an example file illustrating the expected CSV format for this project.
Read this interface for a BooksDataSource class carefully. Think about what features this specification supports and does not support, and collect your questions about the interface's design and its Python details. You will not be allowed to change this interface, so get to know it.
Save a copy of booksdatasource.py in the books directory you created in during Phase #1. Leave it untouched for the remainder of this Phase.
Copy books/booksdatasourcetests.py from my repository, which will give you a class called BooksDataSourceTest inheriting from unittest.TestCase. You may also look at primecheckertests.py to get some inspiration.
Implement a thorough collection of unit tests for the non-constructor methods in BooksDataSource. The goal of these tests is to provide as wide a range of unit tests as you can think of. Don't repeat yourself (you probably wouldn't need to test both square_root(3.0) and square_root(5.0) if you were testing square_root), but also don't be shy about writing lots of tests. Effectively probing the potential vulnerabilities of an interface usually takes lots of little tests.
Your tests may involve using a variety of small CSV files of your own creation, since, for example, it would be easier to test the sorting of a list of three books than a list of 40 books. Save your data files, if any, in the booksdatasource directory along with your usage.txt and *.py files. Name your test data files something consistent and sensible. If you do this, you may find it useful to instantiate your BooksDataSource object inside the test methods instead of inside setUp.
Make sure that the grader and I can run your tests by doing the following:
cd your-repo/books python3 booksdatasourcetests.py
Note that this should just run and produce a test report. (We expect that almost all of your tests will fail at this point. That's normal, since you haven't implemented the BooksDataSource methods yet. The tests will run, but some will generate "ERROR" messages.)

Grading rubric for Phase #2:

1 - author names appear in comment at the top of booksdatasourcetests.py 2 - tests run when combined with the unimplemented booksdatasource.py [we expect most of the tests to fail, of course] 5 - good variety of tests covering both typical and boundary cases [between 10 and 20 tests should be about right] 2 - test code is reasonably easy to read and understand 2 - test code appears to be correct [this will be hard to be certain of until we have implemented booksdatasource.py and debugged both the implementation and the tests]

Phase #3: Implementation (first draft)

Due 9/29, git tag books-implementation

Time to write the program itself!

Implement the methods in BooksDataSource. Since booksdatasource.py is already in your repository, you're just making changes to that file.
Make sure your tests in booksdatasource.py all pass. (By the way, you may find that you have to debug the tests themselves, which may feel weird. This is normal, since it's not easy to debug tests when the thing being tested doesn't exist yet.)
Implement your command-line interface in a new Python program books.py. This program will import booksdatasource.
Make sure that your books.py program and your usage.txt are consistent.

Grading rubric for Phase #3:

1 - usage.txt, makes sense, and supports all features required by the assignment 1 - comment with author names at top of books.py 2 - student's unit tests for booksdatasource.py pass 3 - grader's unit tests for booksdatasource.py pass 1 - user can get help from the command line 3 - required command-line operations work correctly as described in usage.txt 4 - code organization quality, including quality of naming

Phase #4: Code review

In class 10/3, no git tag

Zoom (see my office hours page for the link)
Keep an eye on the Slack #announcement channel for up-to-date details

Here are the instructions for preparing for the code review.

Note in particular: after the code review, send via Slack direct message each of your writeups to Jeff and to the authors of the code you're reviewing.

Grading rubric for Phase #4:

2 - You provided feedback to two other teams as specified 6 - your feedback was helpful (see instructions for suggestions on how to be helpful)

Phase #5: Revision

Due 10/7, git tag books-revision

Here are the instructions for preparing a revision of books.py.

Grading rubric for Phase #5:

4 - quality of responsiveness to code review feedback 6 - quality of the final code

Implementing a command-line interface

There are two main approaches to implementing a command-line syntax: handle the command-line in its raw form (i.e. the list of strings sys.argv) or use a Python module designed to make command-line parsing easier.

For extremely simple programs, using sys.argv directly can be the easiest way to go. Here's a simple example of using sys.argv to parse command-line arguments.

For any program whose command line is going to have a little bit of complexity, it's usually better to use a module like argparse instead of using sys.argv directly. Here's a brief argparse example that you might find helpful: argparse_example.py.

There are many command-line-parsing modules for Python: argparse, getopt, docopt, optparse, click, etc. I'm suggesting argparse for this project because it comes standard with any installation of Python, and it is illustrative of the power (and sometimes the frustration) of using a module like this.

Constraints and suggestions

In your usage.txt, when you're writing command-line syntax synopsis, go ahead and include the "python3" part of what the user would have to type to execute the program. Like so:
python3 books.py some-operation [options]
You may have just one such line in your synopsis, or you might have three (one for each of the features described above). Use the SYNOPSIS section of man-pages for various Unix commands as a rough guide.
You may choose how you want your output to look. Do you want to include the author's name and publication date when you print a book? That's up to you. When you print an author with the author's books, do you indent the books below the author? Do you print a blank line between authors? Again, that's up to you. But please try to make the output easy to read.
You may, but need not, add options to indicate how your program's output will be sorted, displayed, etc.
The official Python documentation for the csv module includes some good, simple example code.
The official Python documentation for the argparse module.
It's also good to search the internet for things like "python csv examples", but be careful to pay attention to cues about the credibility of whatever websites you land on.
Don't try to make this program more complicated than necessary. If you're inclined to keep working once you have the program functioning, use your extra energy to make your program as simple and easy to read as possible instead of adding new features.

Start early, ask questions, and have fun!

(And don't forget Slack—our #questions channel is meant for you!)