CS 334: SQL Assignment

This assignment should be done in pairs, if you are working in a pair.

Note carefully: if you are working in a pair, this must be done with the two of you working side-by-side. Take turns typing; perhaps for each query you might swap the keyboard back and forth, or perhaps you might set a timer and swap every 15 minutes. Do not work on queries separately from each other and combine at the end.

This homework will be graded anonymously via Moodle's anonymous marking feature, so do not include your name anywhere in your submission. We will assume that you are working in your assigned pair, if you have one; if you end up doing the assignment individually for some reason, indicate so in a comment in your submission without actually referring to your own name. (You can also contact me separately if need be.) If you are working in your regularly assigned pair, only one of you should submit the assignment.

Connecting to `thacker`

PostgreSQL is installed on the department machine named thacker. Make sure to connect to thacker before doing anything else for the assignment.

The Database

This assignment will use a fictional database containing college-type data. The data is located in seven flat files. You can find all the flat files in this zip file.

Your first goal is to import this data into relations in PostgreSQL. The schema of the database that you should create is as follows (keys are in bold):

student (sid, sname, sex, age, year, gpa)
dept (dname, numphds)
prof (pname, dname)
course (cno, cname, dname)
major (dname, sid)
section (dname, cno, sectno, pname)
enroll (sid, grade, dname, cno, sectno)

To import a flat file into PostgreSQL, here are the steps you should follow:

Create a new directory for this assignment.
Copy all the data files into your new directory.
Start up PostgreSQL by typing psql at the command line. Enter in the password that you've been emailed.
Create a table called enroll, which will contain the appropriate fields. I recommend that you put all your "CREATE TABLE" statements in a script called createtables.sql, which you can then invoke by typing \i createtables.sql. Page 62 of your textbook shows some great examples on how to do this.
Type \copy enroll from enroll.data into PostgreSQL to import the data. Don't put a semicolon at the end of the line.
Verify that the data made it into PostgreSQL correctly.

You should import the data from the other six relations similarly. The order that you import them in is relevant if you're using foreign keys. Whether or not you do so is up to you.

Queries

Design SQL queries that answer the questions given below (one query per question) and run them using PostgreSQL. Your queries should be correct with respect to the design of this dataset. In other words, we should be able to use your SQL queries on another dataset with the same schema and get correct answers even if the actual data within is different.

The query answers should be duplicate free, but you should use distinct only when necessary in general. Determining whether or not you need distinct should not depend on the particular dataset you have before you, but on its design.

It is most important to write queries that are correct, but a fraction of points will be reserved for good style. Here are some things to consider:

There is generally little agreement on a single standard for formatting and capitalizing your SQL code (indentation, etc). StackOverflow has multiple differing options. Pick one of those styles or something similar, and be consistent.
You should write your queries as briefly as you can. Redundant or gratuitous additional pieces make your queries harder to understand and maintain. In particular, be wary of using nested queries. They have their place, and you will use them in this assignment, but they are not a preferred style if you can get away with something simpler.
Don't use natural joins. Most SQL developers agree that this convenient shortcut is an opportunity for unexpected bugs to occur.

Here are the queries you should write. Put a comment above each query with the number and description so we can easily tell which query is which.

Print the names of professors who work in departments that have fewer than 50 PhD students.
Print the names of the students with the lowest GPA.
For each Computer Sciences class, print the class number, section number, and the average gpa of the students enrolled in the class section.
Print the names and section numbers of all sections with more than six students enrolled in them.
Print the name(s) and sid(s) of the student(s) enrolled in the most sections.
Print the names of departments that have one or more majors who are under 18 years old.
Print the names and majors of students who are taking one of the College Geometry courses.
For those departments that have no major taking a College Geometry course print the department name and the number of PhD students in the department.
Print the names of students who are taking both a Computer Sciences course and a Mathematics course.
Print the age difference between the oldest and the youngest Computer Sciences major.
For each department that has one or more majors with a GPA under 1.0, print the name of the department and the average GPA of its majors.
Print the ids, names and GPAs of the students who are currently taking all the Civil Engineering courses.

All of your SQL queries should be contained in a single file called queries.sql.

Multiple Parts

Part 1: Submit queries 1-2. (just the first two)

Part 2: Submit queries 1-12. (Include the queries from the first part, or fix them if you need to.)

What to Turn In

Submit via Moodle your queries.sql file.

Bonus questions (no extra credit, but more to do if you're having fun)

Does the dataset that I've given you allow one to "cheat" at any of the above questions, i.e. issue a query that is wrong with respect to the schema but still gives the correct answers for this particular instance? If so, which questions are they, and what is it about this sample instance that allows "wrong" queries to work?
There's a programming framework called Django that's quite popular; it allows you to query a database entirely via Python objects, without embedding SQL within your code. It is amazingly cool for developing web applications quickly, and for writing straightforward queries. Writing complex queries in Django, however, can be more complex than in SQL or not doable at all. (Django lets you write a "raw" SQL query if you like, but if you're doing a lot of this, you're losing most of the benefit of using Django in the first place.) If you've been enjoying working with Django, or if you want to experiment with it, install it yourself and see if you can figure out how to rewrite the above queries without using raw SQL.