Carleton Comps Project: Protein Folding

Project: Protein Folding

Advisor: Dave Musicant

Meeting time: TTh 3:10-4:55

Final Results

I. Background

Bioinformatics is an area of study that involves using computing techniques to solve biological problems. One particularly interesting bioinformatics problem is that of protein folding prediction. Proteins are made of chains of amino acids. The particular amino acids that are on a protein chain cause that protein to fold up into a variety of shapes. The shape of a protein is important for understanding how it acts in biological systems; some shapes bind better than others to other substances. If one could predict directly what shape a chain of amino acids would fold into, one could look for cures for diseases with less of a need for experimentation.

The makeup of a protein is described by its so-called "primary" structure, which is an ordered list of the amino acids that chain together that make up that protein. Researchers are interested in being able to use the primary structure to predict "secondary" structure, which describes the protein shape at a fairly highly level, and "tertiary" structure which is a detailed three-dimensional prediction description of the shape of the protein. Machine learning techniques, among others, have been used to learn how to make these predictions.

II. The Project

For this project, you will implement a variety of techniques to solve the protein folding prediction problem, and compare and contrast your approaches. You will build an environment that lets you manipulate protein data, make your predictions, and summarize your results. Graphical visualization of three-dimensional protein shapes is another direction you might take; this would let users of your system physically see the difference between an actual protein structure and the structure that you predict.

Here is a list of the concepts and technologies that will be necessary.

Machine learning algorithms. I know that neural networks have been successfully used for this problem. I'm sure that other techniques have been or could be used as well.
Data manipulation. Protein descriptions are distributed in a standardized format via the Protein Data Bank; you'll need to crack this format in order to manipulate the data appropriately.
3-D rendering. If you choose to go in this direction, you will be taking a three-dimensional description of a protein and figuring out how to display it on the screen.
Distributed computing. The Rosetta@home project attempts to do protein folding prediction by distributing the effort over a large number of computers. You may choose to try something similar.

III. References

Note that the most useful references that you'll find will be published journal articles, most of which are not available on the web. Your group might want to meet before the summer to plan out what materials you'd like to obtain from the library before you leave if you wish to do some reading. Carleton has a very good electronic system for obtaining PDF copies of articles from journals which we do not own; you can get these via the "InterLibrary Loan" link at the library "Find" page.

Rosetta@home, Rosetta Commons. Both of these websites contain links for large numbers of publications.

Rost, B. & Sander, C. (1993). Prediction of Protein Secondary Structure at Better than 70% Accuracy.

Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB)

Worldwide Protein Data Bank wwPDB

Wikipedia article on protein structure