Mark B Gerstein PhD
Albert L. Williams Professor of Biomedical Informatics; Co-Director, Yale Computational biology and Bioinformatics Program
Research Interests
Biochemistry; Bioinformatics; Biophysics; Computational Biology; DNA; Genomics; Molecular Simulation; Proteins; Sequence Alignment; Structural Biology
Research Summary
We do research in bioinformatics, applying computational approaches to problems in molecular biology. Broadly, we are interested in large-scale analyses of genome sequences, macromolecular structures, and functional-genomics datasets. It is hoped that these will allow us to address a number of overall statistical questions about macromolecules, relating to their physical properties, cellular function, interactions, and phylogenetic distribution. We are especially focused on the human genome and proteome. Our research involves a number of quantitative techniques, including database design, systematic datamining and machine learning, visualization of high-dimensional data, and molecular simulation. More specifically, we focus on three questions. First, we are interested in annotating the raw human genome sequence, especially in characterizing the vast intergenic regions and one of their most important elements, pseudogenes. Next, we are trying to get at the function of all the protein elements encoded by the genome. Here, we try to characterize function on a large-scale through the use of molecular networks. Finally, for the population of proteins that have known 3D structures, we are trying to see how their function is carried out through motion and how motion can be predicted from packing geometry.
Extensive Research Description
The biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a most dramatic example of this. Simultaneously, with this increase in biological data, computers and computation have had a transforming effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. The goal of my lab is to connect these two developments, harnessing computational advances for the analysis of large-scale data, principally by carrying out integrative surveys, systematic data mining, and molecular simulation.
Specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. Overall we have four research foci, summarized below.
1. Genomics: Mining Intergenic Regions, especially in relation to Pseudogenes
We
are involved in a number of large-scale collaborations to probe the
activity of intergenic regions with tiling array technology. The
overall conclusion from this work has been that much of the intergenic
regions of the human genome appear to be active, both transcriptionally
and in terms of protein binding. In connection with tiling-array
experiments, we have done an extensive amount of intergenic annotation,
with a particular focus on mining intergenic regions for pseudogenes
(protein fossils). Collectively, our studies enable us to determine the
common "pseudofamilies" in various genomes and address important
evolutionary questions about the proteins that were present in the past
history of an organism.
2. Proteomics: Using Networks to Understand Protein Function
After
the main elements of the human genome are identified, one needs to
characterize their function. We are trying to characterize gene
function through molecular networks. We work on systematically
integrating many weak functional genomic features with data mining
techniques to predict protein networks (comprising protein interactions
and other functional linkages). In addition, we have studied the
structure of protein networks, both on a large-scale in terms of global
statistics (e.g., the diameter) and on a small-scale in terms of local
network motifs (e.g., hubs).
3. Structural Genomics: Analysis of Folds, Families and Functions on a Large Scale
Another
area of research in our lab is structural genomics. Here, we
conceptualize proteins not purely as character sequences or abstract
network nodes, but more in terms of their molecular structure. We have
examined the large-scale relationships between sequence, structure and
function in order to understand the extent to which structural and
functional annotation can reliably be transferred between similar
sequences, particularly when similarity is expressed in modern
probabilistic language. We have also related the occurrence of protein
folds and families to phylogeny and deep evolutionary history.
4. Computational Biophysics: Relating Motions&Packing
The
final area of focus in the lab is analyzing small populations of
structures in terms of their detailed 3D-geometry and physical
properties. Here, we try to interpret macromolecular motions in terms
of packing. We have set up a database of macromolecular motions and
coupled it with simulation tools to interpolate between structural
conformations; the database also has tools to predict likely motions
based on simple models, such as normal modes and localized hinges
connecting rigid domains.
Selected Publications
- Yip KY, Kim PM, McDermott D, Gerstein M. BMC Bioinformatics. 2009 Aug 5;10(1):241. [Epub ahead of print]
- Gerstein, M. and Zheng, D. (2006). The real life of pseudogenes. Sci. Am. 295: 48-55.
- Kim, P.M., Lu, L.J., Xia, Y., and Gerstein, M.B. (2006). Relating three-dimensional structures to protein networks provides evolutionary insights. Science 314:1938-41.
- H. Yu, M. Gerstein (2006). Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A 103: 14724-31. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein M (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009 27:66-75.


