CS 257 Assignment

PostgreSQL is an open-source relational database management system. It works on most platforms, and has taken over top market share position among open-source database systems. It is usually pronounced just "post-gress".

SQL, the Structured Query Language is by far the most common language for extracting information from relational databases.

This lab exercise is intended to help you get started with PostgreSQL and SQL.

Installation

Here are instructions for various situations

macOS: If you use the Homebrew package manager, these instructions take care of basic installation. You can also get a more detailed version of the same thing.
macOS: If you don't use Homebrew, you can install Postgres.app. This gives you a GUI for running the PosgreSQL server, but you'll still use the command line as your PostgreSQL client for creating and modifying your database. You'll want to follow these instructions for modifying your $PATH to enable you to use the command-line client.
Windows: This summary of installation options is good.

Notes

A postgresql installation will have some set of "users". Most installers create a user for you using whatever login name you're using on your computer. You should be able to see the list of users by executing
SELECT * FROM pg_user;
at the psql prompt (see below)
The Postgres.app program automatically creates a starter user and database for you,

Create a database

At your command line, execute:
createdb youraccountname_on_your_computer
Now you're ready to run your postgresql client, psql, at the command line:
psql youraccountname_on_your_computer

This should give you a psql prompt. Mine shows me this:
jondich=#
At the psql prompt, you can see a list of users:
SELECT * FROM pg_user;
or a list of databases:
\l
(that's a backslash-ell, not a backslash-one)

Creating tables

Launch psql with your database:
psql youraccountname_on_your_computer
At the psql prompt, create three tables for your database by running the following SQL queries/statements/commands. (You can just copy and paste each CREATE TABLE statement from this web page to your psql prompt. Make sure to include the terminal semi-colon.)
CREATE TABLE authors ( id SERIAL, last_name TEXT, first_name TEXT, birth_year INT, death_year INT, PRIMARY KEY(id) );

CREATE TABLE books ( id SERIAL, title TEXT, publication_year INT, PRIMARY KEY(id) );

CREATE TABLE books_authors ( book_id INT, author_id INT );
Take a look at your list of tables by executing the special psql directive
\dt
There are many, many psql meta-commands (scroll down to the Meta-Commands section). Feel free to explore.
Finally, view the contents of your authors table:
SELECT * FROM authors;
It's empty, right?

Loading tables from a CSV file

Get these three files: books.csv, authors.csv, and books_authors.csv.

At your psql prompt, execute

COPY books FROM '/full/path/to/books.csv' DELIMITER ',' CSV HEADER;

Then do the same thing for authors/authors.csv and books_authors/books_authors.csv

Note that the HEADER part of this command is saying that the csv file is expected to begin with a line showing the names of the columns, and the remainder of the lines will be rows of data. Check the COPY documentation to see details if you're interested. (Do it! It's good to be curious!)

Some SQL queries

Now that you have the books/authors database available, try a few SQL queries. There's a whole lot to learn about SQL, but this is a simple start.

One piece of general advice. When you try these queries, take the time to think about them. What role does each word or symbol play in the query? What variations could you try, and do those variations work as expected? Check out the formal documentation (e.g. web search for "postgresql select query"), and see how my example query fits the formal documentation. etc.

SELECT * FROM books;
SELECT title, publication_year FROM books;
This is a join of two tables. What happens for the book (Good Omens) that has two authors?
SELECT books.title, authors.last_name FROM books, authors, books_authors WHERE books.id=books_authors.book_id AND authors.id=books_authors.author_id;
What's different here compared to the previous query? Does it affect the output?
SELECT title, last_name FROM books, authors, books_authors WHERE books.id=books_authors.book_id AND authors.id=books_authors.author_id;
Add even more to the WHERE clause to help filter the results.
SELECT books.title, authors.last_name FROM books, authors, books_authors WHERE books.id=books_authors.book_id AND authors.id=books_authors.author_id AND books.publication_year > 1950;
Change an existing item. (Suppose, for example, that you hate that I wrote the first name of Mr. Wodehouse as "P.G.", and you really want it to be "Pelham Grenville" as his mother intended.
UPDATE authors SET first_name='Pelham Grenville' WHERE last_name='Wodehouse';

Adding a row, and fighting with SERIAL

Let's add an author. Given the state we have left things in up to now, we will encounter an interesting error message.

INSERT INTO authors (last_name, first_name, birth_year, death_year) VALUES ('Murakami', 'Haruki', 1949, 0);

(Note, by the way, that in this database, I decided to represent "I'm not dead yet" with death_year=0, rather than death_year=NULL. This may be bad practice, but it makes certain operations easier, which I prefer when introducing a complex tool to a bunch of students who haven't seen it before.)

Anyway, the INSERT statement above generates (for me, anyway) the following error message:

ERROR: duplicate key value violates unique constraint "authors_pkey" DETAIL: Key (id)=(1) already exists.

Interesting. So what's up? The issue is that when we did CREATE TABLE authors, we specified a SERIAL type for id. The way PostgreSQL handles this is by creating not just the authors table, but also a "sequence" item called authors_id_seq. You can view all the sequences in a database in psql with the \ds meta-command, and you can look at the contents of a sequence like this:

SELECT * FROM authors_id_seq;

If you do that, you'll see that authors_id_seq's last_value field has value 1. This means that when we tried to add Haruki Murakami to the authors table, postgresql tried to give him id=1, which it then kept as last_value for authors_id_seq even though the operation failed. Try adding Murakami again, and the sequence will go to last_value=2, and so on until we get past all the used id values.

Note that even though we did our COPY FROM...CSV command after we did CREATE TABLE authors, the authors_id_seq sequence did not get updated accordingly. This certainly happened with the books table as well.

So what to do? We want to set the last_value in authors_id_seq to be the largest author id in our authors table. How to do that? It turns out to be pretty simple:

SELECT setval('authors_id_seq', (SELECT MAX(id) FROM authors));

That is, we first determine the largest value of the id field in the authors table, and then we set the authors_id_seq last_value field to equal that maximum id. Try that, and then take a look (via suitable SELECT statements) at both the authors table contents and the authors_id_seq sequence contents. Do the values all make sense?

OK, so back to Mr. Murakami:

INSERT INTO authors (last_name, first_name, birth_year, death_year) VALUES ('Murakami', 'Haruki', 1949, 0);

No error message this time, and if we take a look at the authors table, now Murakami is right where we wanted him.

So, here's a little exercise for you. Fix the books id sequence so that it will let you add a new book. Then add Murakami's novel, "A Wild Sheep Chase", which was published in 1982.

Some handy tools

As you work to import your dataset into a database, you'll probably have false starts, you'll want to make small adjustments, etc. Here are a few tools that may help you.

Want to save your whole database in a file in case you mess something up and need to start over? Try this at the regular command line (i.e. not in psql):
pg_dump --no-owner --no-privileges name_of_your_database > whatever.sql
This will save instructions for rebuilding and reloading all your tables and sequences.
Whoops! You messed up your authors? Let's restore the whole thing from backup. First, we need to delete all the tables.
psql name_of_your_database name_of_your_database=# DROP TABLE authors, books, books_authors name_of_your_database=# \q
Then, let's do the restoration:
psql name_of_your_database < whatever.sql
where of course whatever.sql is the name of the file you dumped the database into with pg_dump.
The python3 csv module may be handy for helping you to get your data into CSV form in the first place so you can then use COPY to load your data into postgresql tables.

CS 257: Software Design

Getting started with PostgreSQL and SQL