On paper, submit your answers to the following.
- Think back to the plain old simple binary search trees (BSTs) as
you saw then in your Data Structures course: these were just binary
search trees where each node contained a key value, and two
pointers. Answer each of the following questions with a sentence or
two of explanation.
- Are BSTs an ordered index (as your DB textbook defines an
ordered index)?
- Are BSTs, as you saw them, sparse or
dense?
- In a DB context, one also makes the
distinction between primary and secondary indices. Explain why
this distinction doesn't fit well into BSTs as they are
typically presented in a Data Structures class.
- Index-sequential files, also known as ISAM, are
wonderful for doing lookups on large databases that don't change
much. Suppose that we build an ISAM on a sorted file where each page
in the index has room for 20 search keys and pointers. Suppose that
the sorted file itself is sequential on disk, that every page in
that file is completely filled, and that records within this file
have a fixed width.
- If the sorted file itself has 30,000 pages in
it, how many levels will the ISAM have (not including the sorted
file itself)? Give me a number, but also give me the short
mathematical formula that enables you to determine that
answer.
- Based on your answer above, how many disk
accesses are needed (i.e., how many pages must one retrieve) to
obtain a record with a specific search key? (Note, to follow an
FAQ I'm getting: technically, one doesn't know for sure how many
disk access are necessary, because a page that you're looking for
may coincidentally be in the buffer pool. This whole issue can be
simplified by focusing instead on the question "How many pages
must be retrieved?" For purposes of this task, simply answer that
question instead, or alternatively assume that any pages you try
to access are not already in the buffer pool. (Both of these
interpretations amount to the same thing.)
- As a continuation of the previous question,
suppose that I proposed ISAM as a new topic to be discussed in a
Data Structures course. In Data Structures, it is always assumed
that all data fits in memory. So instead of a sorted file of
records, let's instead assume that I have a sorted array of
records. I then create an ISAM on that arary where each node in
the ISAM has 20 search keys and pointers.
- Since everything is now in memory, there's no
need to discuss pages or blocks. (I carefully chose the word
"node" above instead of "page.") Suppose that the lowest level of
my in-memory ISAM contains 30,000 nodes, to be consistent with the
previous example. How many nodes in the ISAM must be accessed in
order to reach an actual record in the array? Give me a number,
but also give me the short mathematical formula that enables you
to determine that answer.
- In general, if I have an in-memory ISAM with n
nodes at the bottom level and k records per page, what would be my
big-O estimate of the amount of work to find a particular record
in the array based on its search key?
- The previous question hopefully pointed out a
gaping hole in the ISAM approach we've discussed: it never came up
at all what algorithm we use to find a particular search
key within an ISAM node. Why? The technique you choose to
use is critical to your answer to the previous question.
- Here's a related followup question: in this
course, we seem to be emphasizing trees with wide fan-out (many
children). In Data Structures, one typically emphasizes trees with
only 2 children (though there is a balanced tree algorithm with 3
children.) Why the dramatic distinction? If so many children work
well in databases, why not do it for in-memory algorithms
also?