CS 334: SQL Assignment
This assignment should be done in pairs, if you are working in a pair.
Note carefully: if you are working in a pair, this must be done with
the two of you working side-by-side. Take turns typing; perhaps for each query
you might swap the keyboard back and forth, or perhaps you might set a timer and
swap every 15 minutes. Do not work on queries separately from
each other and combine at the end.
This homework will be graded anonymously via Moodle's
anonymous
marking feature, so do not include your name anywhere in your
submission. We will assume that you are working in your assigned pair, if
you have one; if you end up doing the assignment individually for some
reason, indicate so in a comment in your submission without actually
referring to your own name. (You can also contact me separately if need be.) If you are working in your regularly assigned pair, only one of you should submit the assignment.
Connecting to thacker
PostgreSQL is installed on the department machine named
thacker. Make sure to connect to thacker before doing
anything else for the assignment.
The Database
This assignment will use a fictional database containing college-type
data. The data is located in seven flat files. You can find
all the flat files in this zip file.
Your first goal is to import this data into relations
in PostgreSQL. The schema of the database that you should create is as follows
(keys are in bold):
student (sid, sname, sex, age, year, gpa)
dept (dname, numphds)
prof (pname, dname)
course (cno, cname, dname)
major (dname, sid)
section (dname, cno, sectno, pname)
enroll (sid, grade, dname, cno, sectno)
To import a flat file into PostgreSQL, here are the steps you
should follow:
- Create a new directory for this assignment.
- Copy all the data files into your new directory.
- Start up PostgreSQL by typing psql at the command
line. Enter in the password that you've been emailed.
- Create a table called enroll, which will contain the appropriate
fields. I recommend that you put all your "CREATE TABLE"
statements in a script called createtables.sql, which you can then invoke by
typing \i createtables.sql. Page 62 of your textbook shows some great
examples on how to do this.
- Type \copy enroll from enroll.data into PostgreSQL to
import the data. Don't put a semicolon at the end of the line.
- Verify that the data made it into PostgreSQL correctly.
You should import the data from the other six relations
similarly. The order that you import them in is relevant if you're
using foreign keys. Whether or not you do so is up to you.
Queries
Design SQL queries that answer the questions given below (one query
per question) and run them using PostgreSQL. Your queries should be
correct with respect to the design of this dataset. In other words,
we should be able to use your SQL queries on another dataset with
the same schema and get correct answers even if the actual data
within is different.
The query answers should be duplicate free, but you should use
distinct only when necessary in general. Determining whether
or not you need distinct should not depend on the particular
dataset you have before you, but on its design.
It is most important to write queries that are correct, but a fraction of
points will be reserved for good style. Here are some things to consider:
- There is generally little agreement on a single standard for formatting and
capitalizing your SQL code (indentation,
etc). StackOverflow
has multiple differing options. Pick one of those styles or something
similar, and be
consistent.
- You should write your queries as briefly as you can. Redundant or gratuitous
additional pieces make your queries harder to understand and maintain. In
particular, be wary of using nested queries. They have their place, and you
will use them in this assignment, but they are not a preferred style if you
can get away with something simpler.
- Don't use natural joins. Most SQL developers agree that this
convenient shortcut is an opportunity for unexpected bugs to occur.
Here are the queries you should write. Put a comment above each query with
the number and description so we can easily tell which query is which.
- Print the names of professors who work in departments
that have fewer than 50 PhD students.
- Print the names of the students with the lowest GPA.
- For each Computer Sciences class, print the class number, section number,
and the average gpa of the students enrolled in the class section.
- Print the names and section numbers of all sections with more than six students
enrolled in them.
- Print the name(s) and sid(s) of the student(s) enrolled in the most sections.
- Print the names of departments that have one or more majors who are under
18 years old.
- Print the names and majors of students who are taking one of the College
Geometry courses.
- For those departments that have no major taking a College Geometry course
print the department name and the number of PhD students in the department.
- Print the names of students who are taking both a Computer Sciences course
and a Mathematics course.
- Print the age difference between the oldest and the youngest Computer
Sciences major.
- For each department that has one or more majors with a GPA under 1.0,
print the name of the department and the average GPA of its majors.
- Print the ids, names and GPAs of the students who are currently taking
all the Civil Engineering courses.
All of your SQL queries should be contained in a single file called
queries.sql.
Multiple Parts
Part 1: Submit queries 1-2. (just the first two)
Part 2: Submit queries 1-12. (Include the queries from the first part, or fix them if you need to.)
What to Turn In
Submit via Moodle your queries.sql file.
Bonus questions (no extra credit, but more to do if you're
having fun)
- Does the dataset that I've given you allow one to "cheat" at any of
the above questions, i.e. issue a query that is wrong with respect to
the schema but still gives the correct answers for this particular
instance? If so, which questions are they, and what is it about this
sample instance that allows "wrong" queries to work?
- There's a programming framework called Django
that's quite popular; it allows you to query a database entirely via
Python objects, without embedding SQL within your code. It is
amazingly cool for developing web applications quickly, and for
writing straightforward queries. Writing complex queries in Django,
however, can be more complex than in SQL or not doable at all. (Django
lets you write a "raw" SQL query if you like, but if you're doing a
lot of this, you're losing most of the benefit of using Django in the
first place.) If you've been enjoying working with Django, or if you
want to experiment with it, install it yourself and see if you can
figure out how to rewrite the above queries without using raw
SQL.