You will work alone on this assignment. Feel free to discuss it with classmates, but
submit your own work.
In the big second-half-of-term project, you're going to choose and download some data, design
a set of database tables to represent it, and use the downloaded data to populate your database.
This process of design and data conversion takes some practice. This current assignment is intended
to give you that practice.
Goals
- Learn about database design, including a little bit about normal forms and
a few ideas about how best to organize relational database tables.
- Design a database.
- Practice writing conversion code to transfer raw data from some external
source (in this case, Kaggle) to
your database.
- Read designs created by other people for the same raw data, and discuss
the relevant design principles and tradeoffs (we'll do this in Discussion Group
after the current assignment is due).
Rubric
1 - CREATE TABLE statements are in olympics-schema.sql
3 - Quality of database design (based on principles in readings and videos)
1 - author names are in a comment at the top of convert.py
3 - convert.py converts the raw CSV files (athlete_events.csv and noc_regions.csv)
into CSV files matching the tables in olympics-schema.sql
3 - the output files from convert.py load successfully into the tables
specified in olympics-schema.sql
4 - the SQL queries in queries.sql run correctly against the resulting
populated database (1 point apiece)
Your tasks
- Read the database readings and watch the database videos included on the course page.
- Create a directory named "olympics" in your repository. All of the
files mentioned below should go in this directory.
- Using psql, create a database named "olympics".
- Download this Olympics database. You'll need to sign up for a Kaggle account, or sign in via your
Facebook, Google, or Yahoo account. If you would prefer not to use any of these
options, let me know, and I will get you a copy of the data.
- Study the structure of the two CSV files.
- Design a database to represent the data from the Kaggle CSV files.
Write your design in the form of CREATE TABLE statements, and save those
statements (and only the CREATE TABLE statements, including semi-colons)
in a text file called olympics-schema.sql.
- Write a Python program called convert.py that will read the
Kaggle CSV files and write one CSV file for each of the tables
in your database design. If you have designed a table named "athletes", for example,
the corresponding CSV file should be named "athletes.csv", and should have
the same columns as the table.
- Add each of your new CSV files to your repository. (You may, if you wish,
also add the original Kaggle files to your repository. But it's not necessary.
That athlete_events.csv file is kind of a beast at 30MB, and you may prefer
to just link to the originals at the top of convert.py or something like that.)
- Write an SQL query for each of the following, based on your database design.
Put your SQL queries in a text file named queries.sql.
- List all the NOCs (National Olympic Committees), in alphabetical order
by abbreviation. These entities, by the
way, are mostly equivalent to countries. But in some cases, you might find that
a portion of a country participated in a particular games (e.g. one guy
from Newfoundland in 1904) or some other oddball situation.
- List the names of all the athletes from Kenya. If your database design
allows it, sort the athletes by last name.
- List all the medals won by Greg Louganis, sorted by year. Include
whatever fields in this output that you think appropriate.
- List all the NOCs and the number of gold medals they have won, in decreasing
order of the number of gold medals.
Some hints and suggestions
- You can test whether your conversion is correct
for your design like this:
- Launch psql for the olympics database
- Execute your CREATE TABLE statements (copy/paste from olympics-schema.sql will work)
- Use the \copy command to load each of your new CSV files into their corresponding
tables (see
the postgres lab for a reminder of how that goes).
- Do a quick "SELECT * FROM tablename LIMIT 5;" or something along those
lines for each table to make sure the data you expect has been loaded
into the table.
- There's an SQL keyword "DISTINCT" that you'll definitely want to use for
some of your SQL statements.
- You may want to investigate the SQL "GROUP BY" construct.
- A detailed discussion of database design principles
and database normalization is too big for this course. That's why I
created
this video about database design.
That will give you more information on what I am looking for.
Have fun!