CS 257: Software Design

The story of a simple database

This is what I did to create a small PostgreSQL database out of some population data.

  1. Found the data in tabular form on the web at http://www.demographia.com/db-uscity1790.htm

  2. Copied and pasted the data into an Excel spreadsheet. The tabs in the data put the numbers into the right columns (lucky me!), so I just rearranged the two-part table into a single table by cutting and pasting the bottom table to the right of the top table. Compared the city names to make sure everything was lining up properly.

    I also had two identical city name columns at this point, so I just deleted the second one.

  3. Saved the spreadsheet as a comma-separated-values (CSV) file called populations.csv.

  4. Opened the CSV file with vi to take a closer look. Noticed some quirks:

    • Excel saved the lines with carriage returns instead of linefeeds as newline characters. This tends to play havoc with the Python scripts I write (since Python by default doesn't view carriage return (a.k.a. CR, a.k.a. \r) as an end-of-line character. I see this often enough that I wrote a dumb little shell script called "cr2lf" to fix it. I run this script by typing "cr2lf populations.csv". Here's the script:
      echo 'tr "\r" "\n" < $1 > $1.tmp'
      tr "\r" "\n" < $1 > $1.tmp
      echo mv $1.tmp $1
      mv $1.tmp $1
      
    • Brooklyn has some weirdness in the table, since its data got merged into New York's data at some point. I fix this by deleting Brooklyn. I could have added the old Brooklyn data to the New York data, but it was easier for this example to just delete it.
  5. Designed a simple-minded SQL table, and wrote the CREATE statement to build the table:

    DROP TABLE IF EXISTS populations;
    CREATE TABLE populations (
      city text,
      year int,
      population int
    );
    
  6. Getting the CSV file into PostgreSQL: a bone-headed way.

    This approach is to turn my data into a bunch of SQL INSERT statements and then execute those statements.

    • Wrote a simple Python script to convert the CSV file into a sequence of SQL INSERT statements.

    • Combined the DROP/CREATE statements with the INSERT statements into a single file called populations.sql.

    • Logged into thacker.mathcs.carleton.edu and used the psql command and the login information that Mike Tie sent me to load my table into my PostgreSQL database, like this:

      psql -f populations.sql

      Note that the installation of psql on thacker assumes my user name is the same as my login name, and prompts me for my password after I type the command. It also dumps status information back to my screen (in this case, I saw a lot of "INSERT 0 1" lines appear after I typed my password).

    • Finally, I ran the psql command prompt to try a quick test:

      psql

      and then at the psql prompt:

      SELECT * FROM populations WHERE year=1900 ORDER BY population DESC;
  7. Getting the CSV file into PostgreSQL: a smarter way.

    Really, there must be a better, more general way than that. Yes, there is.

    So, let's convert our data into a CSV file that matches the structure of the database's table, and then use PostgreSQL's built-in CSV loader.

    • Starting with the CSV file populations.csv that Excel saved and I tweaked as described above, I wrote yet another Python script to convert the population data, which I executed like so:

      python populations2tablerows.py < populations.csv > poprows.csv
    • Logged into thacker

    • Ran "psql -f createtable.sql" where createtable.sql contains the "CREATE TABLE" SQL command. psql asked me for my PostgreSQL password, as usual.

    • Ran "psql" to get the "jondich=>" prompt.

    • Just checking, is my table there? At the psql prompt, type "\dt" (no quotation marks, no semi-colon).

    • At the psql prompt, enter:

      \copy populations FROM '/path/to/poprows.csv' DELIMITER ',' CSV

      Note: psql's backslash directives (e.g. "\copy") do not require or allow a semi-colon at the end. SQL queries (e.g. "SELECT * FROM populations") do require a semi-colon.

    • Did it work?

      SELECT * FROM populations WHERE year=1900 ORDER BY population DESC;