CS 257: Software Design

Getting started with PostgreSQL and SQL

PostgreSQL is an open-source relational database management system. It works on most platforms, and has taken over top market share position among open-source database systems. It is usually pronounced just "post-gress".

SQL, the Structured Query Language is by far the most common language for extracting information from relational databases.

This lab exercise is intended to help you get started with PostgreSQL and SQL.

Installation

Here are instructions for various situations

Notes

Create a database

Creating tables

Loading tables from a CSV file

Get these three files: books.csv, authors.csv, and books_authors.csv.

At your psql prompt, execute

COPY books FROM '/full/path/to/books.csv' DELIMITER ',' CSV HEADER;

Then do the same thing for authors/authors.csv and books_authors/books_authors.csv

Note that the HEADER part of this command is saying that the csv file is expected to begin with a line showing the names of the columns, and the remainder of the lines will be rows of data. Check the COPY documentation to see details if you're interested. (Do it! It's good to be curious!)

Some SQL queries

Now that you have the books/authors database available, try a few SQL queries. There's a whole lot to learn about SQL, but this is a simple start.

One piece of general advice. When you try these queries, take the time to think about them. What role does each word or symbol play in the query? What variations could you try, and do those variations work as expected? Check out the formal documentation (e.g. web search for "postgresql select query"), and see how my example query fits the formal documentation. etc.

Adding a row, and fighting with SERIAL

Let's add an author. Given the state we have left things in up to now, we will encounter an interesting error message.

INSERT INTO authors (last_name, first_name, birth_year, death_year) VALUES ('Murakami', 'Haruki', 1949, 0);

(Note, by the way, that in this database, I decided to represent "I'm not dead yet" with death_year=0, rather than death_year=NULL. This may be bad practice, but it makes certain operations easier, which I prefer when introducing a complex tool to a bunch of students who haven't seen it before.)

Anyway, the INSERT statement above generates (for me, anyway) the following error message:

ERROR: duplicate key value violates unique constraint "authors_pkey" DETAIL: Key (id)=(1) already exists.

Interesting. So what's up? The issue is that when we did CREATE TABLE authors, we specified a SERIAL type for id. The way PostgreSQL handles this is by creating not just the authors table, but also a "sequence" item called authors_id_seq. You can view all the sequences in a database in psql with the \ds meta-command, and you can look at the contents of a sequence like this:

SELECT * FROM authors_id_seq;

If you do that, you'll see that authors_id_seq's last_value field has value 1. This means that when we tried to add Haruki Murakami to the authors table, postgresql tried to give him id=1, which it then kept as last_value for authors_id_seq even though the operation failed. Try adding Murakami again, and the sequence will go to last_value=2, and so on until we get past all the used id values.

Note that even though we did our COPY FROM...CSV command after we did CREATE TABLE authors, the authors_id_seq sequence did not get updated accordingly. This certainly happened with the books table as well.

So what to do? We want to set the last_value in authors_id_seq to be the largest author id in our authors table. How to do that? It turns out to be pretty simple:

SELECT setval('authors_id_seq', (SELECT MAX(id) FROM authors));

That is, we first determine the largest value of the id field in the authors table, and then we set the authors_id_seq last_value field to equal that maximum id. Try that, and then take a look (via suitable SELECT statements) at both the authors table contents and the authors_id_seq sequence contents. Do the values all make sense?

OK, so back to Mr. Murakami:

INSERT INTO authors (last_name, first_name, birth_year, death_year) VALUES ('Murakami', 'Haruki', 1949, 0);

No error message this time, and if we take a look at the authors table, now Murakami is right where we wanted him.

So, here's a little exercise for you. Fix the books id sequence so that it will let you add a new book. Then add Murakami's novel, "A Wild Sheep Chase", which was published in 1982.

Some handy tools

As you work to import your dataset into a database, you'll probably have false starts, you'll want to make small adjustments, etc. Here are a few tools that may help you.