Predicting Secondary Structure
k-Nearest Neighbors
The intuition behind the k-nearest neighbors (k-NN) algorithm is that a known object's values will be more useful for predicting an unknown object's values if those two are "close" in some sense than if they are far apart. To predict a target object, k-NN searches through a database of known objects, finds the one that is closest to the target according to some distance metric, and then extrapolates the match object's known values onto the target's unknown fields.
To adapt this algorithm to predict secondary structure, we had to make several decisions. First, what does it mean for two primary structures to be "close" to each other? And second, once we've determined that two proteins' primary structures are related in some way, how can we extrapolate one protein's secondary structure onto the other?
Answering the first question, about what distance means in terms of protein primary structures, requires a degree of chemical and biological knowledge about proteins that was beyond the scope of our project. Instead, we made use of two existing expert tools to provide us with our distance metrics: BLAST and SAM.
BLAST, the Basic Local Structure Alignment Tool, is a widely used algorithm for comparing sequence information. Given two proteins, BLAST returns a number indicating the BLAST-distance between them, as well as a series of alignments between the two that it used in its calculations. Our project used WU-BLAST, an implementation that was until recently available from Washington University.
SAM, the Sequence Alignment and Modeling System, was developed by Kevin Karplus and his team at UCSC. SAM is also able to get a distance score for two proteins and a set of alignments between them; it uses hidden Markov models to make its predictions.
After using one of these two distance metrics to find a closest match protein, our nearest neighbors algorithm extrapolates its secondary structure onto the target protein. It does this using simple scaling, then refines its results by using the alignments provided by its distance metric.
Our implementation of nearest neighbors is in secondary_predictor_kNN.py
in
the prediction
directory.