PostgreSQL is an open-source
relational database management system.
It works on most platforms, and has taken over
top market share position
among open-source database systems. It is usually pronounced just "post-gress".
SQL, the Structured Query Language is by far
the most common language for extracting information from relational databases.
This lab exercise is intended to help you get started with PostgreSQL and SQL.
Installation
Here are instructions for various situations
Notes
- A postgresql installation will have some set of "users". Most
installers create a user for you using whatever login name you're
using on your computer. You should be able to see the list of
users by executing
SELECT * FROM pg_user;
at the psql prompt (see below)
- The Postgres.app program automatically creates a starter
user and database for you,
Create a database
- At your command line, execute:
createdb youraccountname_on_your_computer
- Now you're ready to run your postgresql client, psql, at
the command line:
psql youraccountname_on_your_computer
This should give you a psql prompt. Mine shows me this:
jondich=#
- At the psql prompt, you can see a list of users:
SELECT * FROM pg_user;
or a list of databases:
\l
(that's a backslash-ell, not a backslash-one)
Creating tables
- Launch psql with your database:
psql youraccountname_on_your_computer
At the psql prompt, create three tables for your database by running
the following SQL queries/statements/commands. (You can just copy and paste
each CREATE TABLE statement from this web page to your psql prompt. Make
sure to include the terminal semi-colon.)
CREATE TABLE authors (
id SERIAL,
last_name TEXT,
first_name TEXT,
birth_year INT,
death_year INT,
PRIMARY KEY(id)
);
CREATE TABLE books (
id SERIAL,
title TEXT,
publication_year INT,
PRIMARY KEY(id)
);
CREATE TABLE books_authors (
book_id INT,
author_id INT
);
- Take a look at your list of tables by executing the special
psql directive
\dt
There are many, many psql meta-commands (scroll down to the Meta-Commands section). Feel
free to explore.
- Finally, view the contents of your authors table:
SELECT * FROM authors;
It's empty, right?
Loading tables from a CSV file
Get these three files: books.csv,
authors.csv, and
books_authors.csv.
At your psql prompt, execute
COPY books FROM '/full/path/to/books.csv' DELIMITER ',' CSV HEADER;
Then do the same thing for authors/authors.csv and books_authors/books_authors.csv
Note that the HEADER part of this command
is saying that the csv file is expected to begin with a line showing
the names of the columns, and the remainder of the lines will be rows of
data. Check the COPY documentation to see
details if you're interested. (Do it! It's good to be curious!)
Some SQL queries
Now that you have the books/authors database available, try a few
SQL queries. There's a whole lot to learn about SQL, but this is
a simple start.
One piece of general advice. When you try these queries, take
the time to think about them. What role does each word or symbol
play in the query? What variations could you try, and do those
variations work as expected? Check out the formal documentation
(e.g. web search for "postgresql select query"), and see how my
example query fits the formal documentation. etc.
- SELECT * FROM books;
- SELECT title, publication_year FROM books;
- This is a join
of two tables. What happens for the book
(Good Omens)
that has two authors?
SELECT books.title, authors.last_name
FROM books, authors, books_authors
WHERE books.id=books_authors.book_id
AND authors.id=books_authors.author_id;
- What's different here compared to the previous query? Does it affect the output?
SELECT title, last_name
FROM books, authors, books_authors
WHERE books.id=books_authors.book_id
AND authors.id=books_authors.author_id;
- Add even more to the WHERE clause to help filter the results.
SELECT books.title, authors.last_name
FROM books, authors, books_authors
WHERE books.id=books_authors.book_id
AND authors.id=books_authors.author_id
AND books.publication_year > 1950;
- Change an existing item. (Suppose, for example, that
you hate that I wrote the first name of Mr. Wodehouse as
"P.G.", and you really want it to be "Pelham Grenville" as
his mother intended.
UPDATE authors
SET first_name='Pelham Grenville'
WHERE last_name='Wodehouse';
Adding a row, and fighting with SERIAL
Let's add an author. Given the state we have left things in
up to now, we will encounter an interesting error message.
INSERT INTO authors
(last_name, first_name, birth_year, death_year)
VALUES ('Murakami', 'Haruki', 1949, 0);
(Note, by the way, that in this database, I decided to represent
"I'm not dead yet"
with death_year=0, rather than death_year=NULL. This may be bad
practice, but it makes certain operations easier, which I prefer
when introducing a complex tool to a bunch of students who haven't
seen it before.)
Anyway, the INSERT statement above
generates (for me, anyway) the following error message:
ERROR: duplicate key value violates unique constraint "authors_pkey"
DETAIL: Key (id)=(1) already exists.
Interesting. So what's up? The issue is that when we did
CREATE TABLE authors, we specified
a SERIAL type for id.
The way PostgreSQL handles this is by creating not just the
authors table, but also a "sequence" item called
authors_id_seq. You can view all the sequences in a database
in psql with the \ds meta-command,
and you can look at the contents of a sequence like this:
SELECT * FROM authors_id_seq;
If you do that, you'll see that authors_id_seq's last_value
field has value 1. This means that when we tried to add
Haruki Murakami to the authors table, postgresql tried to
give him id=1, which it then kept as last_value for authors_id_seq
even though the operation failed. Try adding Murakami again,
and the sequence will go to last_value=2, and so on until we
get past all the used id values.
Note that even though we did our COPY FROM...CSV
command after we did CREATE TABLE authors,
the authors_id_seq sequence did not get updated accordingly.
This certainly happened with the books table as well.
So what to do? We want to set the last_value in authors_id_seq to
be the largest author id in our authors table. How to do that? It turns
out to be pretty simple:
SELECT setval('authors_id_seq', (SELECT MAX(id) FROM authors));
That is, we first determine the largest value of the id
field in the authors table, and then we set the authors_id_seq last_value
field to equal that maximum id. Try that, and then take a look
(via suitable SELECT statements) at both the authors table contents
and the authors_id_seq sequence contents. Do the values all make sense?
OK, so back to Mr. Murakami:
INSERT INTO authors
(last_name, first_name, birth_year, death_year)
VALUES ('Murakami', 'Haruki', 1949, 0);
No error message this time, and if we take a look at the authors
table, now Murakami is right where we wanted him.
So, here's a little exercise for you. Fix the books id sequence so that
it will let you add a new book. Then add Murakami's novel,
"A Wild Sheep Chase", which was published in 1982.
Some handy tools
As you work to import your dataset into a database, you'll
probably have false starts, you'll want to make small adjustments,
etc. Here are a few tools that may help you.
- Want to save your whole database in a file in case
you mess something up and need to start over? Try this
at the regular command line (i.e. not in psql):
pg_dump --no-owner --no-privileges name_of_your_database > whatever.sql
This will save instructions for rebuilding and reloading
all your tables and sequences.
- Whoops! You messed up your authors? Let's restore the
whole thing from backup. First, we need to delete all the
tables.
psql name_of_your_database
name_of_your_database=# DROP TABLE authors, books, books_authors
name_of_your_database=# \q
Then, let's do the restoration:
psql name_of_your_database < whatever.sql
where of course whatever.sql is the name of the file
you dumped the database into with pg_dump.
- The python3 csv module
may be handy for helping you to get your data into CSV form
in the first place so you can then use COPY
to load your data into postgresql tables.