Query Optimization, Team Part

Note to self: Make sure to keep just the three relevant tables for next time: for example, get rid of word_to_url_summary. The extra tables confused people. On 1-many vs. many-many, lots of people missed it, or assumed it was implicit in picture. Specify next time how you want it.

This assignment must be done in pairs. It involves the use of a very large database, and we've only been granted permission to do it on one particular machine in the department. You are welcome to do this on your own computer, if you are inclined to learn how to install PostgreSQL, etc., though it is undoubtedly easier just to use our department machine. If you do choose to do this from home, you'll have to figure out how to adapt the directions (grabbing files, etc.) You're on your own if you do this: I'm only guaranteeing support in the lab.

Setup

Find machine cmc307-3 in CMC 307, and boot it to Linux. Login. Change your current directory to the course directory, e.g. at at terminal window type

cd /Accounts/courses/cs334/data
Start up PostgreSQL. The user name is "cs334", and the password is "dave@x4369." So to start up PostgreSQL, type
psql -U cs334
and enter the password at the prompt accordingly. Once you've managed to get into PostreSQL, drop all tables and indexes that previous students might have left behind. You can see all tables with the "\dt" command, and all indexes with the "\di" command. To drop a table or an index, here are examples:
DROP TABLE link;
DROP INDEX index1;
Once you have dropped all the tables and indexes, restore the data for this assignment from backup by typing
\i pgcrawler.dump
This should take about 6.5 minutes or so, so get familiar with the rest of the assignment while doing this. You'll see some "CREATE TABLE", "ALTER TABLE," and "NOTICE" messages dumped to the screen. That's fine. If you haven't correctly dropped all your tables before invoking this script, you might have a problem.

Getting to it

The data which you see was obtained by a Carleton comps group a few years ago that built a search engine (ala Google). This data is what they obtained from a crawl at the time of Carleton's intranet. This data may be incomplete, and it is most certainly out of date, but it represents a neat snapshot of Carleton's web presence at the time.

First, get familiar with the data. We will not use all of the tables that are here; rather, we will focus on the tables word, word_to_url, and url. Look at the columns that each has (the PostgreSQL command "\d" is good here. You might also want to print out the first 10 records of each just to see them; you can do so by typing:

select * from whatever limit 10;
The following instructions have specific tasks mixed in. You should turn in on paper whatever the tasks ask for.

Task 1: Draw an E-R diagram showing the entities and relationships that these three tables model. It should be a fairly straightforward picture. Indicate one-to-many, many-to-many, or one-to-one relationships as they exist.

The key purpose of this data was to be a search engine. In other words, for a particular word that someone was searching for, the goal was to find all URLs (websites) that contained that word.

Write an SQL query to print out all web pages that contain the word "computer". (In reality, this query would really be constructed underneath a web page, like you did for the previous assignment. Run the query and observe how long it takes.

Task 2: Include here your SQL query, and your estimate of how long the query took. How does this compare to what you would expect from a good search engine?

Use the EXPLAIN command (see the individual part of this assignment for more details) to obtain the query evaluation plan.

Task 3: Transcribe the above plan into a relational algebra query evaluation tree, similar to that which your textbook does. If the dataset were much bigger (i.e., on the scale of the Internet), how would you expect your query time to change? Answer quantitiatively using your best guesstimate of the size of the Internet relative to this dataset.

You should be able to do something that will make this query and all others like it (e.g., for all other words one might look up) produce output almost instantly. Do so.

Task 4: What did you do to make your query run so much more quickly? Indicate the precise syntax that you typed into PostgreSQL to make it happen. How fast would you estimate that your query would now run, on the scale of the entire Internet?

When done, drop all of your tables and indexes. Your fellow students will thank you.

What to Turn In

Submit on paper whatever the tasks specify. Have fun!