Advisor: Layla Oesper
Human DNA can be interpreted as a string on the character set A,C,G and T, representing the different bases or nucleotides in DNA. In total, a human’s genome contains over 3 billion such characters. When a person’s genome is sequenced, we don’t get a complete picture of their genetic code. Instead, what we obtain are many millions or billions of short strings (~200-1000 characters), with some noise, that exist somewhere in their genome. To make sense of this data, these short reads are aligned to an existing reference genome, which is a representation of an idealized member of the species created from sequences from multiple individuals. Alignments of reads are not always perfect. A read may contain mismatches relative to the reference, or be missing subsequences that exist in the reference, or contain extra sequences that do not exist in the reference. Given the size of the reference genome (> 3 billion characters) and the number of reads to align (millions-billions), this means alignment is a computationally and algorithmically challenging problem.
The Burrows Wheeler Transform (BWT) is a compression technique (allows for the storage of data in less space) on strings that has been used in a number of different domains. For instance, it is the algorithm used in the bzip2 file compression program. The BWT is particularly useful when the string to be compressed is repetitive - a key feature of human DNA. A number of algorithms have been proposed that build upon the BWT to efficiently align short DNA sequences to a reference genome.
In this project you will investigate the Burrows Wheeler Transform (BWT) and its application to short read DNA alignment. In particular you will:
Previous biology experience is NOT required for this project. Other courses that could be useful for this project include algorithms, and computational biology.
Below are a few papers about existing work related to the BWT in DNA sequence aligners. These are only intended to provide you a minimal start for your literature search - they are certainly not the only nor necessarily the best sources for ideas. You will be finding and reading many additional papers!
Fall Term, MWF 5A