Lab: convert a CSV dataset to postgresql

This lab will walk you through the process of converting unnormalized CSV data into a better structure for querying via SQL.

The plan

Design the database tables you want to end up with, and write your design as a collection of CREATE TABLE SQL statements.
Use psql to create the tables.
Write some python to convert your CSV data into a set of new CSV files, one per table.
Use the \copy postgres directive to load each CSV file into its corresponding table.

Setup

I'm assuming here that you will be working on stearns. But if you succeed at installing and running postgres on your own computer, you can do this work locally instead.

Login to stearns.mathcs.carleton.edu.
Pull the latest changes into your clone of my cs257-2025-spring git repository.
Take a quick look at the file books/books2.csv from my repository. Note how it stores the information about an author and how it deals with books that have multiple authors.

Design the tables

I'm going to give you a database design for the purposes of this lab. Soon, you will be doing your own database design based on your project's dataset.

The goals of the following database design are:

Eliminate textual data duplication. (Note, for example, how the books2.csv file contains multiple repetitions of the names "P.G. Wodehouse" and "Connie Willis" because there are multiple books written by both of those authors.)
Make it easy to query the database about either books or authors or both.
Support books that have more than one author.
Support authors who have written more than one book.

Here are the tables we're going to use:

CREATE TABLE authors ( id integer NOT NULL, surname text, given_name text, birth_year integer, death_year integer ); CREATE TABLE books ( id integer NOT NULL, title text, publication_year integer ); CREATE TABLE books_authors ( book_id integer, author_id integer );

The books_authors table is an example of something often known as a linking table. The idea is that each row of books_authors concisely expresses the idea that "author X is one of the authors of book Y".

Create the tables in postgres

Launch psql -U YOUR_USER_NAME YOUR_USER_NAME.

Note that the second user name in that command is the name of your database. Mike Tie has set things up for you on stearns so you control your own database, which is named after you. When I do this work on my own computer, I typically create a database named books or something like that.
Copy the three CREATE TABLE statements above and paste them into the terminal at the psql prompt. Hit Enter to make sure the three statements get executed.
Check to make sure the tables exist now by executing \dt at the psql prompt. The "dt" here stands for "display tables".

Create separate csv files for each table

Next, we want to convert the books2.csv file into three separate CSV files, one per table in our database design.

Look at the table designs above and the contents of books2.csv and compare. What data that goes in the tables is missing from books2.csv?
Open up my books/csv2tables.py program and read through it. What is it doing?
Run it to convert books2.csv into authors.csv, books.csv, and books_authors.csv.
Take a look at the resulting files. Do they look right?

Load the CSV files into their corresponding tables

At the psql prompt, do this:

\copy authors FROM 'authors.csv' DELIMITER ',' CSV NULL AS 'NULL'

and similarly for the other two tables.

Check to make sure it worked

Try these from the psql prompt.

SELECT * FROM authors; SELECT * FROM books; SELECT * FROM books_authors;

Do you see what you expect to see?

Run some interesting queries

Try this at the psql prompt. Try to make sense of what it's asking.

SELECT books.title, books.publication_year FROM books, authors, books_authors WHERE authors.surname = 'Willis' AND authors.id = books_authors.author_id AND books.id = books_authors.book_id;

SELECT authors.given_name, authors.surname, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id;

SELECT authors.given_name, authors.surname, books.title, books.publication_year FROM books, authors, books_authors WHERE books.id = books_authors.book_id AND authors.id = books_authors.author_id AND authors.surname LIKE 'B%' ORDER BY authors.surname, authors.given_name;

Now try to come up with a query that reports all the authors of the book whose title is "Character Limit".

All done!

Feel free to play around with any of this to figure it out better.