CS 257 Assignment

You will work alone on this assignment. Feel free to discuss it with classmates, but submit your own work.

In the big second-half-of-term project, you're going to choose and download some data, design a set of database tables to represent it, and use the downloaded data to populate your database. This process of design and data conversion takes some practice. This current assignment is intended to give you that practice.

Goals

Learn about database design, including a little bit about normal forms and a few ideas about how best to organize relational database tables.
Design a database.
Practice writing conversion code to transfer raw data from some external source (in this case, Kaggle) to your database.
Read designs created by other people for the same raw data, and discuss the relevant design principles and tradeoffs (we'll do this in Discussion Group after the current assignment is due).

Rubric

1 - CREATE TABLE statements are in olympics-schema.sql 3 - Quality of database design (based on principles in readings and videos) 1 - author names are in a comment at the top of convert.py 3 - convert.py converts the raw CSV files (athlete_events.csv and noc_regions.csv) into CSV files matching the tables in olympics-schema.sql 3 - the output files from convert.py load successfully into the tables specified in olympics-schema.sql 4 - the SQL queries in queries.sql run correctly against the resulting populated database (1 point apiece)

Your tasks

Read the database readings and watch the database videos included on the course page.
Create a directory named "olympics" in your repository. All of the files mentioned below should go in this directory.
Using psql, create a database named "olympics".
Download this Olympics database. You'll need to sign up for a Kaggle account, or sign in via your Facebook, Google, or Yahoo account. If you would prefer not to use any of these options, let me know, and I will get you a copy of the data.
Study the structure of the two CSV files.
Design a database to represent the data from the Kaggle CSV files. Write your design in the form of CREATE TABLE statements, and save those statements (and only the CREATE TABLE statements, including semi-colons) in a text file called olympics-schema.sql.
Write a Python program called convert.py that will read the Kaggle CSV files and write one CSV file for each of the tables in your database design. If you have designed a table named "athletes", for example, the corresponding CSV file should be named "athletes.csv", and should have the same columns as the table.
Add each of your new CSV files to your repository. (You may, if you wish, also add the original Kaggle files to your repository. But it's not necessary. That athlete_events.csv file is kind of a beast at 30MB, and you may prefer to just link to the originals at the top of convert.py or something like that.)
Write an SQL query for each of the following, based on your database design. Put your SQL queries in a text file named queries.sql.

List all the NOCs (National Olympic Committees), in alphabetical order by abbreviation. These entities, by the way, are mostly equivalent to countries. But in some cases, you might find that a portion of a country participated in a particular games (e.g. one guy from Newfoundland in 1904) or some other oddball situation.
List the names of all the athletes from Kenya. If your database design allows it, sort the athletes by last name.
List all the medals won by Greg Louganis, sorted by year. Include whatever fields in this output that you think appropriate.
List all the NOCs and the number of gold medals they have won, in decreasing order of the number of gold medals.

Some hints and suggestions

You can test whether your conversion is correct for your design like this:
- Launch psql for the olympics database
- Execute your CREATE TABLE statements (copy/paste from olympics-schema.sql will work)
- Use the \copy command to load each of your new CSV files into their corresponding tables (see the postgres lab for a reminder of how that goes).
- Do a quick "SELECT * FROM tablename LIMIT 5;" or something along those lines for each table to make sure the data you expect has been loaded into the table.
There's an SQL keyword "DISTINCT" that you'll definitely want to use for some of your SQL statements.
You may want to investigate the SQL "GROUP BY" construct.
A detailed discussion of database design principles and database normalization is too big for this course. That's why I created this video about database design. That will give you more information on what I am looking for.

CS 257: Software Design

Database design

Goals

Rubric

Your tasks

Some hints and suggestions

Have fun!