Lab: convert a CSV dataset to postgresql
This lab will walk you through the process of converting unnormalized CSV data into a better structure for querying via SQL.
The plan
- Design the database tables you want to end up with, and write
your design as a collection of
CREATE TABLESQL statements. - Use
psqlto create the tables. - Write some python to convert your CSV data into a set of new CSV files, one per table.
- Use the
\copypostgres directive to load each CSV file into its corresponding table.
Setup
I'm assuming here that you will be working on stearns. But if you succeed at installing and running postgres on your own computer, you can do this work locally instead.
- Login to stearns.mathcs.carleton.edu.
- Pull the latest changes into your clone of my cs257-2025-spring git repository.
- Take a quick look at the file
books/books2.csvfrom my repository. Note how it stores the information about an author and how it deals with books that have multiple authors.
Design the tables
I'm going to give you a database design for the purposes of this lab. Soon, you will be doing your own database design based on your project's dataset.
The goals of the following database design are:
- Eliminate textual data duplication. (Note, for example, how the books2.csv file contains multiple repetitions of the names "P.G. Wodehouse" and "Connie Willis" because there are multiple books written by both of those authors.)
- Make it easy to query the database about either books or authors or both.
- Support books that have more than one author.
- Support authors who have written more than one book.
Here are the tables we're going to use:
The books_authors table is an example of something often known as
a linking table. The idea is that each row of books_authors
concisely expresses the idea that "author X is one of the authors of book Y".
Create the tables in postgres
Launch
psql -U YOUR_USER_NAME YOUR_USER_NAME.Note that the second user name in that command is the name of your database. Mike Tie has set things up for you on stearns so you control your own database, which is named after you. When I do this work on my own computer, I typically create a database named
booksor something like that.- Copy the three
CREATE TABLEstatements above and paste them into the terminal at the psql prompt. Hit Enter to make sure the three statements get executed. - Check to make sure the tables exist now by executing
\dtat the psql prompt. The "dt" here stands for "display tables".
Create separate csv files for each table
Next, we want to convert the books2.csv file into three separate CSV files, one per table in our database design.
- Look at the table designs above and the contents of
books2.csvand compare. What data that goes in the tables is missing frombooks2.csv? - Open up my
books/csv2tables.pyprogram and read through it. What is it doing? - Run it to convert
books2.csvintoauthors.csv,books.csv, andbooks_authors.csv. - Take a look at the resulting files. Do they look right?
Load the CSV files into their corresponding tables
At the psql prompt, do this:
and similarly for the other two tables.
Check to make sure it worked
Try these from the psql prompt.
Do you see what you expect to see?
Run some interesting queries
Try this at the psql prompt. Try to make sense of what it's asking.
Now try to come up with a query that reports all the authors of the book whose title is "Character Limit".
All done!
Feel free to play around with any of this to figure it out better.