Query Optimization, Individual Part

This assignment should be done individually. You may work with other people, get ideas, help each other debug, and so on; but the content that you submit should be your own.

Setup

This part of the assignment uses the same college enrollment data that you used for the last SQL assignment. If you do not still have the data set up from the last assignment, visit the previous SQL assignment web page to find instructions on connecting to lime and loading the data.

Learning about EXPLAIN

Read the documentation about EXPLAIN in the PostgreSQL manual. Look at the examples at the end to see specifically how to use it.

Questions

I've provided below some queries that were submitted by students in the past for questions from the previous SQL assignment. For some of the questions from the previous assignment, I'm providing two or three alternative choices as solutions that yield the same answer when you run them. For each group of queries, answer the following questions:

  1. Are the alternative queries all correct and/or equivalent? Are they guaranteed to give the same results (apart from order)? If they are not equivalent, why are they different? If the answer is that it depends on something, what does it depend on? [You should take it as given that the database is correct as specified in the previous assignment, including the setup of keys for each table.]
  2. Use EXPLAIN to find the query plan for each query. Transcribe each plan into a tree-like evaluation plan such as the one your textbook shows on page 585. Explain in words how the plans differ. This shouldn't be a long-winded explanation. Keep it short, but highlight what the key differences are between the plans. Explain which plan is likely to be better if the dataset were dramatically bigger; your answer here should be based on doing an estimate of how many pages will be read/written during the course of query evaluation. If that depends on what "bigger" means, explain what you're thinking.
  3. Also explain (or speculate) why the plans are different for these queries. If the queries technically do something different, that could be one explanation. But if the queries are guaranteed to provide the same result, why might the optimizer come up with different plans?

The queries

The numbers on these queries are those from the previous assignment. Feel free to go back and look to remind yourself what each query was supposed to do.

Query 3

-- Version A --
SELECT E.cno, E.sectno, AVG(S.gpa)
FROM enroll E, student S
WHERE S.sid = E.sid AND E.dname = 'Computer Sciences'
GROUP BY E.cno, E.sectno;
-- Version B --
SELECT A.cno, A.sectno, AVG(A.gpa)
FROM   (SELECT E.sid, S.gpa, E.cno, E.sectno
        FROM enroll E, student S
        WHERE S.sid = E.sid AND E.dname = 'Computer Sciences') A
GROUP BY A.cno, A.sectno;
-- Version C --
SELECT section.cno AS Course, section.sectno AS Section, AVG(student.gpa) AS avgGPA
FROM section, enroll, student
WHERE enroll.cno=section.cno AND enroll.sid = student.sid AND 
      enroll.sectno = section.sectno AND section.dname='Computer Sciences'
GROUP BY section.cno, section.sectno;

Query 4

-- Version A --
SELECT temp.sectno, temp.cname
FROM   (SELECT E.cno, E.sectno, C.cname
        FROM enroll E, course C
        WHERE E.cno = C.cno AND E.dname = C.dname) temp
GROUP BY temp.cno, temp.sectno, temp.cname
HAVING COUNT(*) > 6;
-- Version B --
SELECT course.cname, thistable.sectno
FROM course, (SELECT enroll.cno, enroll.sectno, enroll.dname, COUNT (*) AS count
              FROM enroll
              GROUP BY enroll.cno, enroll.sectno, enroll.dname) AS thistable
WHERE course.cno=thistable.cno AND course.dname=thistable.dname
      AND thistable.count > 6;

Query 5

-- Version A --
SELECT S.sid,S.sname
FROM   student S, enroll E
WHERE  S.sid=E.sid
GROUP BY S.sid, S.sname
HAVING COUNT(*) >= ALL (SELECT COUNT(*)
                        FROM student S, enroll E
                        WHERE S.sid=E.sid
                        GROUP BY s.sid);
-- Version B --
SELECT  S.sid, S.sname
FROM    student S
WHERE   S.sid IN (SELECT   E.sid
                  FROM     enroll E
                  GROUP BY E.sid
                  HAVING   COUNT(*) = (SELECT  MAX(count)
                                       FROM    (SELECT   COUNT(*)
                                                FROM     enroll E
                                                GROUP BY E.sid) AS foo));
-- Version C --
SELECT temp.sid, S.sname
FROM (SELECT COUNT(*), E2.sid
      FROM enroll E2
      GROUP BY E2.sid) AS temp, student S
WHERE temp.count = (SELECT MAX(temp.count)
                    FROM (SELECT COUNT(*), E2.sid
                          FROM enroll E2
                          GROUP BY E2.sid) AS temp)
       AND temp.sid = S.sid;

Query 7

-- Version A --
SELECT  S.sname, M.dname
FROM    student S, major M
WHERE   S.sid = M.sid AND
        S.sid IN (SELECT  S.sid
                  FROM    student S, enroll E, course C
                  WHERE   S.sid = E.sid AND E.cno = C.cno AND E.dname = C.dname
                          AND SUBSTRING(C.cname from 1 for 16) = 'College Geometry');
-- Version B --
SELECT S.sname, M.dname
FROM   enroll E, course C, major M, student S
WHERE  C.cname LIKE 'College Geometry%' AND E.cno = C.cno
       AND M.sid = E.sid AND E.sid = S.sid;
-- Version C --
SELECT S.sname, M.dname
FROM   student S, major M
WHERE  S.sid = M.sid AND
       S.sid IN (SELECT E.sid
                 FROM enroll E
                 WHERE E.cno IN (SELECT C.cno
                                 FROM course C
                                 WHERE cname LIKE 'College Geometry%' ));

What to Turn In

Submit on paper the answers to the questions above for each set of queries.