Minibase Assignment #4: External Sorting

Assigned on Wednesday, 11/5.
Due on electronically on Monday, 11/17, at 4:45 PM.

Introduction

In this assignment, you will implement external sorting. You should begin by reading the chapters on Sorting, and on Disks and Files.


Getting Started

Create a directory for the assignment, and change to that directory. Copy all the files from /Accounts/courses/cs347/proj4/ to your current directory. The files include, among others:


What You Have to Implement


You have to implement the constructor of the Sort class, which will do all the sorting work:
class Sort {
public:
Sort(
char* inFile, // Name of unsorted heapfile.
char* outFile, // Name of sorted heapfile.
int len_in, // Number of fields in input records.
AttrType in[], // Array containing field types of input records.
// Field type can be attrInteger or attrString.
// Index of in[] ranges from 0 to (len_in - 1).
short str_sizes[], // Array containing field sizes of input records.
int fld_no, // The number of the field to sort on.
// fld_no ranges from 0 to (len_in - 1).
TupleOrder sort_order, // The order in which the records are sorted.
// can be Ascending or Descending
int amt_of_buf, // Number of buffer pages available for sorting.
Status& s // Status of call
);
};

Structure of Data Records

A data record contains multiple fields. The number of fields is indicated by len_in. Each field might have different length, which is indicated by the str_sizes[] array.

The type of each field can be an Integer or a String, which is indicated by the in[] array.

Suppose a heap file with the name "infile'' stores the students' records. The format of the student record is as follows:

struct StudentRecord {
        int sid;
        char name[10];
        int age;
}

We want to sort the records in ascending order of the age attribute and store the sorted records in a heap file named "outfile''. The number of available buffer frames is 10. The arguments to the constructor of the Sort class are as follows:

inFile = "infile"
outFile = "outfile"
len_in = 3
in = {attrInteger, attrString, attrInteger}
str_sizes = {4, 10, 4}
fld_no = 2
sort_order = Ascending
amt_of_buf = 10

Internal Design

You have to sort a file of fixed size records, stored in a heap file on disk, using a given amount of main memory, which will be equal to amt_of_buf * PAGESIZE. The sort process consists of two passes as follows:

Pass 1: The first phase of sorting is performed by dynamically allocating the given amount of main memory.  Read the data from the unsorted input heap file, one record at a time, into this "sorting area''.  Keep doing this until the sorting area is full.

Sort this set of records by calling the qsort library function (see the man pages on qsort if you are unfamiliar with it). Your call to qsort will take a pointer to your sorting area, the
number of records contained, the size of each record, and a pointer to your comparison function.

qsort does a generalized quick-sort, calling your comparison function to determine the order of any two records it needs to compare.  Your comparison function should take two pointers to Tuples, and return -1, 0, or 1. If type of the sorted field is attrInteger, then you can compare two fields directly. If the type of the sorted attribute is attrString, you need to use strncmp to compare the corresponding string fields from two records. See the man page and the sample program (qsort_example.C) for more information about qsort.

Name your comparison function tupleCmp().

After calling qsort, write the sorted run out to a temporary heap file:

Passes 2 and Above: In the merging passes, you will primarily use frames in the buffer pool, rather than additional main memory, unlike Pass 1.  You are allowed to use amt_of_buf frames in the buffer pool --- amt_of_buf - 1 buffer frames are for input buffers, and one buffer frame is for the output buffer.  The simplest way to do this is to open scans on amt_of_buf - 1 (heap files containing) runs from the previous pass. As for the output, you should create a new temporary file, and insert records into them from the input runs, ``merging'' the input runs.

The temporary files used to hold input runs should be deleted (with the deleteFile method) after the pass in which they are merged. The sorting is complete when you are left with just one file. Remember that the name of the result file should be outFile.

Handing in Your Code

Use hsp to hand in the entire directory containing your code. Do not submit the individual files, but rather submit the directory as a whole. For example, if I am sitting in the directory above proj4, I would execute the following command:

hsp musicant CS347 proj4

Good luck!