CS 334: Optimization of slow queries

Table of Contents

Overview

This is a pair assignment.

This assignment involves the use of a very large database so that you can make some reasonable timing measurements where scale matters. We've only been granted permission on nine machines in the department to do this: CMC307-01 through CMC307-10 (except, mysteriously, CMC307-06). You are welcome to do this on your own computer, if you are inclined to learn how to install PostgreSQL, etc., though it is undoubtedly easier just to use our department machines. If you do choose to do this on your own computer, you'll have to figure out how to adapt the directions (grabbing files, etc.) You're on your own if you do this: I'm only guaranteeing support in the lab.

As soon as possible, coordinate with your teammate and have one of you sign up for a time slot.

Setup

Sit down in front of one of the lab computers mentioned above. Login. Change your current directory to following directory, i.e. at a terminal window type

cd /Accounts/courses/cs334/data

Start up PostgreSQL with the following command:

/Library/PostgreSQL/11/bin/psql -U cs334 cs334

Mike Tie has asked me to not put the password in the clear the web to save us from attackers, so I have posted the password on Moodle.

Once you've managed to get into PostreSQL, drop all tables and indexes that previous students might have left behind. You can see all tables with the "\dt" command, and all indexes with the "\di" command. To drop a table or an index, here are examples:

DROP TABLE mytable;
DROP INDEX index1;

It will be useful to know how long all of the queries that you do take. Issue the following command at the Postgres prompt:

\timing

This will show at the end of every query you issue how long it took.

Now you're read to import the data you need. To do so, restore the data for this assignment from backup by typing

\i crawler2019.dump

This may take 5 minutes or so, so be patient. You'll see some "CREATE TABLE", "ALTER TABLE," and "NOTICE" messages dumped to the screen. You'll also see an error regarding a plugin you don't have access to, but it doesn't matter. Likewise, if you haven't correctly dropped all your tables before invoking this script, you might have a problem.

Warning from Mike Tie: do NOT copy this 2.4G dump file into your home directory. You'll blow out our server disk space.

Getting to it

The data which you see was obtained by a Carleton comps group a few years ago that built a search engine (ala Google). This data is what they obtained from a crawl at the time of Carleton's intranet. This data may is incomplete, and out of date, and I've enhanced it with some additional random data to make the dataset bigger.

First, get familiar with the data. Look at the columns that each table has (the PostgreSQL command "\d" is good here). You might also want to print out the first 10 records of each just to see them; you can do so by typing:

select * from whatever limit 10;

The key purpose of this data was to be a search engine. In other words, for a particular word that someone was searching for, the goal was to find all URLs (websites) that contained that word.

The following instructions have specific *tasks mixed in. You should submit whatever the tasks ask for in italics.

Task 1: Observe the problem.

Write an SQL query to print out all web pages that contain the word "computer". (In reality, this query would really be constructed underneath a web page.) Run the query and observe how long it takes. Run the query again at least three more times, using other words instead of "computer". What was the SQL query you constructed? Submit it for one word in particular. Also provide an estimate of approximately how long the query takes for each word. Tell me how this compares with what you would expect from a good search engine?


Editorial notice: In various comps projects in the past, it is at about this point in time when students have come to their comps adviser (sometimes me, sometimes others) holding pitchforks and torches. They bellow that the hardware they have been given is not powerful enough to accomplish their task in a reasonable period of time.


Task 2: Diagnose the problem.

Obtain the query evaluation plan by putting the word EXPLAIN in front of your query. Explain why the query is taking so long to run.

Task 3: Fix the problem.

You should be able to do something that will make this query and all others like it (e.g., for any other words one might look up) produce output almost instantly. You might need to look through the PostgreSQL documentation to see exactly how to issue the syntax, but there should be a very simple and straightforward thing to do to dramatically speed up this query. This should require a single SQL query/command. Do it. You may find that it may take a long time for this single command to run, but if it makes all subsequent searches for individual words much faster, it's worth it. What did you do to make your query run so much more quickly? Indicate the precise syntax that you typed into PostgreSQL to make it happen. What is the amount of time it takes to run your query now?

Task 4: Understand the fix.

Obtain again the query evaluation plan, now that the problem is solved. Explain what is being done differently in the evaluation plan, and why it helps.

When done, drop all of your tables and indexes. Your fellow students will thank you.

What to turn in

Submit your answers to the above tasks in a PDF. Remember this experience.