This term, we have developed a little code based on a small
dataset of books and authors. So far, our dataset has been stored using CSV files.
Though CSV is a pretty common format for transferring data from person to person,
it has a lot of weaknesses. So this week, we're going to start using a more
powerful tool for storing and querying our data: a
relational database.
This lab exercise will introduce you to the database management system
PostgreSQL (usually called just "postgres"),
and to the
Structured Query Language (SQL).
1. Install postgres
First up, you need to install postgres on your work machine. You will use the
package manager you learned about in the setup lab during Week 1. Here are some blogs
that might be helpful:
I'm going to leave this part of this lab up to you, but you might find the two videos
I posted on the main course page helpful. As I've mentioned before,
one of the skills I want you to take away from this class is an increased independence in
setting up and maintaining your development environment. That said, I strongly urge
you to share your questions and your discoveries with each other on Slack. In particular,
if you find a great tutorial on getting started with postgres, please post a link!
You will be ready to go with postgres once you have:
To get there, you'll need to do a few things in addition to the installation.
- Learn how to launch and shut down the postgres server running on your computer
(this may vary by OS)
- Learn how to launch and quit from the "psql" command-line interface
- Create a new postgres user for yourself (hint: keep your eyes peeled in
whatever tutorial you use for a line like: CREATE ROLE newuser WITH LOGIN PASSWORD 'password';)
- Ensure that your postgres user name has permission to create new databases
(hint: ALTER ROLE newuser CREATEDB;)
- Create a new database called "books", owned or manageable by your new user name
NOTE 1: I want you to be able to do all of this via your Unix command line.
There are GUI versions of the postgres tools, and you can use them in the
future, but I want you to learn the core concepts and learn how to navigate
postgres using the command line first.
NOTE 2: Really, ask each other for help on Slack. I'll be there, too, of course.
2. psql basics
For the remainder of this lab, I will assume that you are typing commands
(both psql directives and
SQL queries) at the psql prompt (books=>) not at the Unix prompt. Watch out for the difference!
psql is a tool that allows you to create and manipulate databases. You'll be typing two kinds of
commands at the psql prompt:
- psql directives, which start with a backslash. Like "\q" to quit psql, "\?" to get help,
and "\l" (that's lowercase ELL, not the number ONE) to get a list of databases.
- SQL queries, which do not start with a backslash. SQL query keywords are generally
typed in all-caps, though it's not required. (Do it anyway, since it's a nearly universal
convention that everybody expects you to follow.) SQL queries may be typed on multiple lines,
but they must end with a semi-colon (;).
Try \?, \l, and \q, and see what they give you. (Of course, after \q, you'll have to
launch psql again.
3. Tables, data, and simple SQL queries
Our books/authors database will consist of three tables. The first table will
be called authors. Each row of the authors table will represent a single author,
much in the same way each instance of a Java class called Author would represent a single author.
Let's create the authors table. Type or paste the following CREATE TABLE statement at the
psql prompt, and hit return. Note that you can type this one line at a time, and the statement
will only get executed once you have typed the semi-colon followed by a return.
CREATE TABLE authors (
id SERIAL,
surname text,
given_name text,
birth_year integer,
death_year integer
);
Notice a couple things about this CREATE TABLE statement.
Type "\d" at the psql prompt. You should see information about both the authors table and the
corresponding id sequence.
Type "\d authors" to see the structure of the authors table.
Type the SQL query:
SELECT * FROM authors;
to see that you don't have any authors in the table yet.
- Pick a favorite book written by somebody other than the authors in my data. I'll use
"Pattern Recognition" (2003) by William Gibson as my example.
Let's add Gibson to our authors table:
INSERT INTO authors
(surname, given_name, birth_year, death_year)
VALUES ('Gibson', 'William', 1948, NULL);
Now try the SELECT query from above again. Do you see your author in the table?
Here's what I see:
id | surname | given_name | birth_year | death_year
----+---------+------------+------------+------------
1 | Gibson | William | 1948 |
(1 row)
- Go through the last few steps again to create the books table and
a book or two by your author:
CREATE TABLE books (
id SERIAL,
title text,
publication_year integer
);
followed by some suitable INSERT statements, one for each book.
- At this point, we have an authors table, a books table, one author, and one book.
But we have not added any information to the database to indicate that our author
wrote our book. Here's how we do that:
CREATE TABLE books_authors (
book_id integer,
author_id integer
);
and
INSERT INTO books_authors (book_id, author_id) VALUES (1, 1);
Note that this is a pretty trivial case. We have one author (William Gibson, id=1) and
one book (Pattern Recognition, id=1), so adding (1, 1) to the books_authors table just
says that author #1 wrote book #1. As we add more authors and more of their books,
of course, the (book_id, author_id) pairs will be more interesting. We might see, for
example, rows in the books_authors table like (3, 5) and (17, 5), which would say that
author #5 wrote both book #3 and book #17. Or (3, 5) and (3, 12), which would say that
book #3 was co-written by author #5 and author #12.
Got it? An authors table giving information about each author, a books table giving information
about each book, and a books_authors table showing the authorship relationship between
books and authors.
Now let's BURN IT ALL DOWN and start over:
DROP TABLE authors;
DROP TABLE books;
DROP TABLE books_authors;
4. Loading and querying our books and authors
- Pull the latest changes to
my GitHub repository,
and copy the files
books/books2.csv,
books/authors2.csv, and
books/books_authors2.csv,
into your working directory.
- Go back and redo the three CREATE TABLE statements from earlier in the lab to create three empty tables.
- Type:
\copy authors FROM 'authors2.csv' DELIMITER ',' CSV NULL AS 'NULL'
- Try this SQL query:
SELECT * FROM authors;
Do you see the list of authors? (I hope so!)
- Repeat the \copy and SELECT steps for books2.csv and books_authors2.csv.
Once you're confident that all three tables have the right data in them, we can
do something cool. We can combine these tables to search the data in lots of
interesting ways. SQL has enormous power and a ton of features, and we'll barely
scratch the surface in this class. But here are a few example queries that you might find interesting.
Try to guess what they do before pasting the into psql. Then see if you were right.
SELECT authors.given_name, authors.surname, books.title, books.publication_year
FROM books, authors, books_authors
WHERE books.id = books_authors.book_id
AND authors.id = books_authors.author_id;
SELECT books.title, books.publication_year
FROM books, authors, books_authors
WHERE books.id = books_authors.book_id
AND authors.id = books_authors.author_id
AND authors.surname = 'Willis';
SELECT authors.given_name, authors.surname, books.title, books.publication_year
FROM books, authors, books_authors
WHERE books.id = books_authors.book_id
AND authors.id = books_authors.author_id
AND authors.surname LIKE 'B%'
ORDER BY authors.surname, authors.given_name;
SELECT authors.given_name, authors.surname, books.title, books.publication_year
FROM books, authors, books_authors
WHERE books.id = books_authors.book_id
AND authors.id = books_authors.author_id
ORDER BY authors.surname;
5. All done!
Now you're ready to do some cool stuff with a database. We're gonna have some fun!