CS 257: Software Design

PostgreSQL lab

Nothing to hand in

This term, we have developed a little code based on a small dataset of books and authors. So far, our dataset has been stored using CSV files. Though CSV is a pretty common format for transferring data from person to person, it has a lot of weaknesses. So this week, we're going to start using a more powerful tool for storing and querying our data: a relational database.

This lab exercise will introduce you to the database management system PostgreSQL (usually called just "postgres"), and to the Structured Query Language (SQL).

1. Install postgres

First up, you need to install postgres on your work machine. You will use the package manager you learned about in the setup lab during Week 1. Here are some blogs that might be helpful:

I'm going to leave this part of this lab up to you, but you might find the two videos I posted on the main course page helpful. As I've mentioned before, one of the skills I want you to take away from this class is an increased independence in setting up and maintaining your development environment. That said, I strongly urge you to share your questions and your discoveries with each other on Slack. In particular, if you find a great tutorial on getting started with postgres, please post a link!

You will be ready to go with postgres once you have:

To get there, you'll need to do a few things in addition to the installation.

NOTE 1: I want you to be able to do all of this via your Unix command line. There are GUI versions of the postgres tools, and you can use them in the future, but I want you to learn the core concepts and learn how to navigate postgres using the command line first.

NOTE 2: Really, ask each other for help on Slack. I'll be there, too, of course.

2. psql basics

For the remainder of this lab, I will assume that you are typing commands (both psql directives and SQL queries) at the psql prompt (books=>) not at the Unix prompt. Watch out for the difference!

psql is a tool that allows you to create and manipulate databases. You'll be typing two kinds of commands at the psql prompt:

Try \?, \l, and \q, and see what they give you. (Of course, after \q, you'll have to launch psql again.

3. Tables, data, and simple SQL queries

Our books/authors database will consist of three tables. The first table will be called authors. Each row of the authors table will represent a single author, much in the same way each instance of a Java class called Author would represent a single author.

  1. Let's create the authors table. Type or paste the following CREATE TABLE statement at the psql prompt, and hit return. Note that you can type this one line at a time, and the statement will only get executed once you have typed the semi-colon followed by a return.

    CREATE TABLE authors ( id SERIAL, surname text, given_name text, birth_year integer, death_year integer );

    Notice a couple things about this CREATE TABLE statement.

    • CREATE TABLE etc. is a statement in the Structured Query Language, which is virtually always abbreviated SQL, and pronounced either "ESS-CUE-ELL" or "Sequel". As noted above, SQL statements and queries are distinct from the psql directives like \d and \q, but you can type either SQL or psql directives at the psql prompt.
    • The type SERIAL is an integer that postgres will automatically initialize and increment for you. So every time you add a new author to the table, it will get the next integer id available.
    • When you executed the CREATE TABLE statement, you should have seen this:
      NOTICE: CREATE TABLE will create implicit sequence "authors_id_seq" for serial column "authors.id" CREATE TABLE
      This is psql's way of telling you that it implemented SERIAL by creating a table-like object known as a "sequence" to keep track of the next available id value.
    • A table has "fields" or "columns". These play roughly the same role as instance variables in classes.
    • SQL has the notion of a NULL value, which indicates "this field has no value at all". Note that NULL in this context is not the same as zero. For the books/authors example, we're going to use NULL in the death_year field to indicate that the author is still alive.
  2. Type "\d" at the psql prompt. You should see information about both the authors table and the corresponding id sequence.

    Type "\d authors" to see the structure of the authors table.

    Type the SQL query:

    SELECT * FROM authors;
    to see that you don't have any authors in the table yet.

  3. Pick a favorite book written by somebody other than the authors in my data. I'll use "Pattern Recognition" (2003) by William Gibson as my example. Let's add Gibson to our authors table:
    INSERT INTO authors (surname, given_name, birth_year, death_year) VALUES ('Gibson', 'William', 1948, NULL);
    Now try the SELECT query from above again. Do you see your author in the table? Here's what I see:
    id | surname | given_name | birth_year | death_year ----+---------+------------+------------+------------ 1 | Gibson | William | 1948 | (1 row)
  4. Go through the last few steps again to create the books table and a book or two by your author:
    CREATE TABLE books ( id SERIAL, title text, publication_year integer );
    followed by some suitable INSERT statements, one for each book.
  5. At this point, we have an authors table, a books table, one author, and one book. But we have not added any information to the database to indicate that our author wrote our book. Here's how we do that:
    CREATE TABLE books_authors ( book_id integer, author_id integer );
    and
    INSERT INTO books_authors (book_id, author_id) VALUES (1, 1);
    Note that this is a pretty trivial case. We have one author (William Gibson, id=1) and one book (Pattern Recognition, id=1), so adding (1, 1) to the books_authors table just says that author #1 wrote book #1. As we add more authors and more of their books, of course, the (book_id, author_id) pairs will be more interesting. We might see, for example, rows in the books_authors table like (3, 5) and (17, 5), which would say that author #5 wrote both book #3 and book #17. Or (3, 5) and (3, 12), which would say that book #3 was co-written by author #5 and author #12.
  6. Got it? An authors table giving information about each author, a books table giving information about each book, and a books_authors table showing the authorship relationship between books and authors.

    Now let's BURN IT ALL DOWN and start over:

    DROP TABLE authors; DROP TABLE books; DROP TABLE books_authors;

4. Loading and querying our books and authors

  1. Pull the latest changes to my GitHub repository, and copy the files books/books2.csv, books/authors2.csv, and books/books_authors2.csv, into your working directory.
  2. Go back and redo the three CREATE TABLE statements from earlier in the lab to create three empty tables.
  3. Type:
    \copy authors FROM 'authors2.csv' DELIMITER ',' CSV NULL AS 'NULL'
  4. Try this SQL query:
    SELECT * FROM authors;
    Do you see the list of authors? (I hope so!)
  5. Repeat the \copy and SELECT steps for books2.csv and books_authors2.csv.
  6. Once you're confident that all three tables have the right data in them, we can do something cool. We can combine these tables to search the data in lots of interesting ways. SQL has enormous power and a ton of features, and we'll barely scratch the surface in this class. But here are a few example queries that you might find interesting. Try to guess what they do before pasting the into psql. Then see if you were right.

    SELECT authors.given_name, authors.surname, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id;
    SELECT books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id AND authors.surname = 'Willis';
    SELECT authors.given_name, authors.surname, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id AND authors.surname LIKE 'B%' ORDER BY authors.surname, authors.given_name;
    SELECT authors.given_name, authors.surname, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id ORDER BY authors.surname;

5. All done!

Now you're ready to do some cool stuff with a database. We're gonna have some fun!