CS 257: Software Design

The story of a simple database

This is what I did to create a small PostgreSQL database out of some population data.

Here are a few SQL examples and the PostgreSQL documentation.

  1. Found the data in tabular form on the web at http://www.demographia.com/db-uscity1790.htm

  2. Copied and pasted the data into an Excel spreadsheet. The tabs in the data put the numbers into the right columns (lucky me!), so I just rearranged the two-part table into a single table by cutting and pasting the bottom table to the right of the top table. Compared the city names to make sure everything was lining up properly.

    I also had two identical city name columns at this point, so I just deleted the second one.

    (What if copy/paste hadn't worked? I probably could have written some sort of Python script to parse the data into columns based on how it was organized in the HTML. That's called "scraping" the HTML page, and is idiosyncratic, often unreliable, and usually a hassle.)

  3. Saved the spreadsheet as a comma-separated-values (CSV) file called populations.csv.

  4. Opened the CSV file with vi to take a closer look. Noticed some quirks:

    • Excel saved the lines with carriage returns instead of linefeeds as newline characters. This tends to play havoc with the Python scripts I write (since Python by default doesn't view carriage return (a.k.a. CR, a.k.a. \r) as an end-of-line character. I see this often enough that I wrote a dumb little shell script called "cr2lf" to fix it. I run this script by typing "cr2lf populations.csv". Here's the script:
      echo 'tr "\r" "\n" < $1 > $1.tmp' tr "\r" "\n" < $1 > $1.tmp echo mv $1.tmp $1 mv $1.tmp $1
    • Brooklyn has some weirdness in the table, since its data got merged into New York's data at some point. I fix this by deleting Brooklyn. I could have added the old Brooklyn data to the New York data, but it was easier for this example to just delete it.
  5. Designed a simple-minded SQL table, and wrote the CREATE statement to build the table:

    DROP TABLE IF EXISTS populations; CREATE TABLE populations ( city text, year int, population int );
  6. Getting the CSV file into PostgreSQL: a bone-headed way.

    This approach is to turn my data into a bunch of SQL INSERT statements and then execute those statements.

    • Wrote a simple Python script to convert the CSV file into a sequence of SQL INSERT statements.

    • Combined the DROP/CREATE statements with the INSERT statements into a single file called populations.sql.

    • Logged into thacker.mathcs.carleton.edu and used the psql command and the login information that Mike Tie sent me to load my table into my PostgreSQL database, like this:

      psql -f populations.sql

      or

      psql --username myname --dbname myname --host thacker.mathcs.carleton.edu -f populations.sql

      Note that the installation of psql on thacker assumes my user name is the same as my login name, and prompts me for my password after I type the command. It also dumps status information back to my screen (in this case, I saw a lot of "INSERT 0 1" lines appear after I typed my password).

    • Finally, I ran the psql command prompt to try a quick test:

      psql

      and then at the psql prompt:

      SELECT * FROM populations WHERE year=1900 ORDER BY population DESC;
  7. Getting the CSV file into PostgreSQL: a smarter way.

    Really, there must be a better, more general way than that. Yes, there is.

    So, let's convert our data into a CSV file that matches the structure of the database's table, and then use PostgreSQL's built-in CSV loader.

    • Starting with the CSV file populations.csv that Excel saved and I tweaked as described above, I wrote yet another Python script to convert the population data, which I executed like so:

      python populations2tablerows.py < populations.csv > poprows.csv
    • Logged into thacker

    • Ran "psql -f createtable.sql" where createtable.sql contains the "CREATE TABLE" SQL command. psql asked me for my PostgreSQL password, as usual.

    • Ran "psql" to get the "jondich=>" prompt.

    • Just checking, is my table there? At the psql prompt, type "\dt" (no quotation marks, no semi-colon).

    • At the psql prompt, enter:

      \copy populations FROM '/path/to/poprows.csv' DELIMITER ',' CSV

      Note: psql's backslash directives (e.g. "\copy") do not require or allow a semi-colon at the end. SQL queries (e.g. "SELECT * FROM populations") do require a semi-colon.

    • Did it work?

      SELECT * FROM populations WHERE year=1900 ORDER BY population DESC;

    Want a normalized database without the duplication of city names that we see here? Try this script, which generates cities.csv and pops.csv.