Database design
Folder: olympics
Hey Jeff, next time you need to have them do pg_dump to make grading less of a giant headache.
You will work alone on this assignment. Feel free to discuss it with classmates, but submit your own work.
In upcoming multi-week web application project, you're going to choose and download some data, design a set of database tables to represent it, and use the downloaded data to populate your database. This process of design and data conversion takes some practice. This current assignment is intended to give you that practice.
Goals
- Learn about database design, including a little bit about normal forms and a few ideas about how best to organize relational database tables.
- Design a database.
- Practice writing conversion code to transfer raw data from some external source (in this case, Kaggle) to your database.
- Read designs created by other people for the same raw data, and discuss the relevant design principles and tradeoffs (we'll do this in small groups after the current assignment is due).
Rubric
Your tasks
- Read the database readings and watch the database videos included on the course page.
- Create a directory named "olympics" in your repository. All of the files mentioned below should go in this directory.
- Using psql, create a database named "olympics".
- Download this Olympics database. You'll need to sign up for a Kaggle account, or sign in via your Facebook, Google, or Yahoo account. If you would prefer not to use any of these options, let me know, and I will get you a copy of the data.
- Study the structure of the two CSV files.
- Design a database to represent the data from the Kaggle CSV files. Write your design in the form of CREATE TABLE statements, and save those statements (and only the CREATE TABLE statements, including semi-colons) in a text file called olympics-schema.sql.
- Write a Python program called convert.py that will read the Kaggle CSV files and write one CSV file for each of the tables in your database design. If you have designed a table named "athletes", for example, the corresponding CSV file should be named "athletes.csv", and should have the same columns as the table.
- Add each of your new CSV files to your repository. Please do not add the original Kaggle files to your repository. Kaggle's athlete_events.csv file is kind of a beast at 30MB, and it's better to just put a comment at the to of convert.py telling the user of convert.py how to get the source data.
Write an SQL query for each of the following, based on your database design. Put your SQL queries in a text file named queries.sql.
- List all the NOCs (National Olympic Committees), in alphabetical order by abbreviation. These entities, by the way, are mostly equivalent to countries. But in some cases, you might find that a portion of a country participated in a particular games (e.g. one guy from Newfoundland in 1904) or some other oddball situation.
- List the names of all the athletes from Jamaica. If your database design allows it, sort the athletes by last name.
- List all the medals won by Greg Louganis, sorted by year. Include whatever fields in this output that you think appropriate.
- List all the NOCs and the number of gold medals they have won, in decreasing order of the number of gold medals.
Some hints and suggestions
You can test whether your conversion is correct for your design like this:
- Launch psql for the olympics database
- Execute your CREATE TABLE statements (copy/paste from olympics-schema.sql will work)
- Use the \copy command to load each of your new CSV files into their corresponding tables (see the postgres lab for a reminder of how that goes).
- Do a quick "SELECT * FROM tablename LIMIT 5;" or something along those lines for each table to make sure the data you expect has been loaded into the table.
- There's an SQL keyword "DISTINCT" that you'll definitely want to use for some of your SQL statements. Look it up.
- You may want to investigate the SQL "GROUP BY" construct.
- A detailed discussion of database design principles and database normalization is too big for this course. That's why I created and assigned the "miniscule intro to database design" linked on the course web page. That video will help you understand what I am looking for.