Disease Gene Prioritization for Complex Diseases Using Large-Scale Biological Networks

Final Results

Disease Gene Prioritization for Complex Diseases Using Large-Scale Biological Networks

Advisor: Layla Oesper

Background

Many human diseases have genetic origins. A gene is a sequence of DNA that codes for a protein that is associated with a particular biological function (you can think of genes and proteins as having a 1-1 relationship). A mutation, or aberration, that occurs in a gene and disrupts the normal function is called a disease gene. While some human diseases like cystic fibrosis or Huntington’s disease are caused by a mutation to a single gene, many other diseases such as cancer or diabetes are complex diseases and result from mutations to many genes. Identifying the genes associated with a particular disease is an essential step towards the development of treatments or diagnostics tests for these diseases.

Recent advances in DNA sequencing technologies have allowed for unprecedented analysis of the genetic origins of such complex diseases. However, the human genome has ~20,000 genes, so prioritization of candidate disease genes before experimental testing is essential. One recent approach to prioritizing such disease genes utilizes a particular type of biological data encoded in a large network or graph called a protein-protein interaction network (PPI). In these networks each vertex represents a protein (or the gene that codes for that protein) and each edge represents a pair of proteins that have been shown to have some form biological interaction. The idea is that the topology of a large PPI network and the location of the known disease genes in that network may be useful for identifying and ranking other potential disease genes. For example, a naive approach may be to rank genes based on how many known disease genes they have as direct neighbors.

networks

(Left) Yeast Protein-Protein Interaction Network, (Right) Human Protein-Protein Interaction Network.

The project

In this project you will investigate graph-based methods for disease gene prioritization using large-scale protein-protein interaction networks. In particular you will:

Investigate different PPI networks and disease gene sets that are available and analyze the different properties of each.
Study existing literature on graph-based algorithms for disease gene prioritization.
Understand and implement several different graph-based methods for disease gene prioritization given a large network and a set of known disease genes.
Evaluate the performance of the methods you implement. You will need to think carefully about how you will do this evaluation since the whole point of these methods is to infer something that is unknown.
Consider how biases in the underlying PPI networks affect the output from your methods.
(If time) Consider extensions on the basic disease gene prioritization problem such as adding an importance score to each known disease gene.

Recommended experience

Enthusiasm for working on biologically relevant problems is encouraged, but previous biology experience is NOT required for this project. Experience working with large datasets will be useful, but not required. Other courses that could be useful for this project include algorithms, linear algebra, AI, data mining, and computational biology.

References/inspiration

Below are a few papers about existing work in disease gene prioritization. These are only intended to provide you a minimal start for your literature search - they are certainly not the only nor necessarily the best sources for ideas. You will be finding and reading many additional papers!

Köhler, Sebastian, et al. "Walking the interactome for prioritization of candidate disease genes." The American Journal of Human Genetics 82.4 (2008): 949-958.
Karni, Shaul, Hermona Soreq, and Roded Sharan. "A network-based method for predicting disease-causing genes." Journal of Computational Biology 16.2 (2009): 181-189.
Yin, Tianshu, et al. "GenePANDA—a novel network-based gene prioritizing tool for complex diseases." Scientific reports 7 (2017): 43258.

Meeting Times

Tuesday/Thursday 1:15pm - 2:15pm for Fall and Winter