Mark Gerstein, PhD
Research & Publications
We do research in biomedical data science, applying computational approaches to problems in molecular biology and genetics. We are interested in large-scale analyses of genome sequences and macromolecular structures. We also work on analyzing images and large-scale text and bio-sensor data. Our research involves several quantitative techniques, including database design, systematic data mining and deep learning, visualization of high-dimensional data, and molecular simulation. We specifically focus on annotating the human genome sequence, especially in characterizing the vast intergenic regions and interpreting disease-associated variants. Doing this at scale requires tackling issues of genomic privacy (to enable data sharing) and better representing the disease phenotypes associated with the variants. Next, we are trying to get at the function of all the genes encoded by the genome using molecular networks. Finally, for the group of protein-coding genes with known 3D structures, we are trying to see how their function is carried out through motion.
Specialized Terms: Biochemistry; Bioinformatics; Biophysics; Computational Biology; DNA; Genomics; Molecular Simulation; Proteins; Sequence Alignment; Structural Biology
Extensive Research Description
Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lens through which the public views biology. Appropriately interpreting personal genomes – particularly in relation to disorders such as cancer and neurological diseases – is therefore of key importance. Moreover, the expansion of DNA sequencing and other data generating technologies related to images, structures and sensors is making biomedical data science a growing area with broad connections to other data-intensive disciplines with the umbrella of data science. Significant advances in computing have paralleled the ongoing rapid growth in biomedical data generation, giving rise to new approaches in machine learning, network science, and physical modeling. The Gerstein lab acts as a connector, bringing quantitative approaches from disciplines such as computer science and statistics to address practical questions and large-scale data in molecular biology. Often, we carry out our work in multi-disciplinary teams through collaborative efforts (e.g. in consortia such as PsychENCODE, ENCODE, GSP/CMG and 1000 Genomes). Below, we describe specific aspects of our research that work toward our overall goal of interpreting personal genomes and advancing biomedical data science as a field.
At the heart of our lab is human genome annotation. We have made significant efforts to annotate the human genome, through active participation in worldwide collaborations including ENCODE, modENCODE, and GENCODE, as well as through the development of computational tools. Our work targets coding and noncoding genomic regions and ranks somatic and germline variants in relation to their potential functional impact and deleteriousness in causing disease.
Our main focus has been on transcription factor binding sites and non-coding RNAs (ncRNAs). We have developed tools that leverage comprehensive CHIP sequencing (CHIP-seq) data to detect transcription factor binding sites and utilize this type of information to predict the expression of target genes. Other tools that we built identify ncRNAs and regions of intragenic transcription by processing datasets from RNA-seq and CHIP-seq assays. Additionally, we have contributed to the annotation of pseudogenes. See encode/annotation and pseudogene papers.
Disease Genomics (Neurogenomics & Cancer Genomics)
The declining cost of next-generation sequencing has allowed researchers to rapidly study the genomic contributions to disease. In the Gerstein Lab, we have contributed to this effort through comprehensive studies and computational tools that aim to establish connections between genomic variants and disease. We have studied a considerable number of diseases with a focus on cancers and brain disorders. Recent efforts include developing tools to prioritize noncoding driver mutations in cancer, integrating genome annotations with cancer genomes to develop a resource for cancer genomics, and studying the effects of nondriver mutations in cancer. In tandem, we co-led an effort to establish a comprehensive functional genomic resource that pertains to the human brain, which involved integrating data at the single-cell level with plentiful bulk functional genomics data. See neurogenomics and cancer genomics papers.
Personal Genomics & Genomic Privacy
For specific personal genomes, we have developed various “callers” to find variants. Comparing variant calls between individuals shows that all humans share the vast majority of their genomes, yet a small fraction of each individual’s genome sequence shapes her or his unique combination of physical and physiological traits. We have developed tools that study personal genomics and link molecular phenotypes such as gene expression to differences in parental alleles.
Overall, this work reveals the potential for high-dimensional genomic data to reveal sensitive personal information such as disease states. Using information theory and other approaches, we have developed tools to assess the feasibility of sharing molecular data without jeopardizing the privacy of sample donors. See personal genomics and privacy papers.
Data Science and Biological Networks
We have developed tools to build and analyze multi-omics, regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. We have also integrated networks with dynamic gene-expression data (identifying transient hubs), three-dimensional protein structures, and other regulatory data to find large-scale regulatory principles for biological systems. Finally, people have better intuition for commonplace networks – such as those in social and computer systems – compared to biological networks. Thus, we have found that cross-disciplinary comparisons are helpful to elucidate system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraints. See data science, network, and bioinformatics tools papers.
Macromolecular Motion & Dynamics
While non-coding regions play an important, if underappreciated, role in genome function and disease, we also characterize coding sequences and drill deep into their protein products. By analyzing protein motions, we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in a standardized fashion in terms of key statistics, such as the degree of rotation around hinges. Our approach is guided by the fact that protein mobility is highly restricted by tight packing. We have developed a variety of tools to analyze protein structures and motions, including measuring packing efficiency using specialized geometric constructions (e.g., Voronoi polyhedra). Recently, we applied a combination of molecular motion simulations and network analyses to identify cancer mutation hotspots within proteins. See: molecular motion and structure papers.
Interpretable Machine Learning Tools
The rapid increase in biomedical data during the last two decades has also engendered the need for artificial intelligence tools that can find patterns embedded in large-scale datasets and study a variety of data representations. In machine learning – a branch of artificial intelligence that integrates algorithmic and statistical techniques – we have developed tools to perform predictive tasks that provide insights on genomic research. We focus on approaches that are “interpretable” in that they have a clear biological or physical basis. Our tools process large-scale datasets to, for instance, functionally prioritize genomic variants with respect to their biological function or potential contribution to disease or predict protein binding. See Gerstein Lab repository on GitHub for more details.
See Papers.GersteinLab.org — in particular, Best Papers and listing of Key Contributions.
Some talks giving a quick overview of the lab: 5′ animation (’20), 15′ powerpoint (’19)
More information on research interests can also be found here.
Biochemistry; Biophysics; Computer Simulation; DNA; Image Processing, Computer-Assisted; Medical Informatics; Computational Biology; Genomics; Proteomics; Molecular Dynamics Simulation; Wearable Electronic Devices; Deep Learning; Data Science
- Data Sanitization to Reduce Private Information Leakage from Functional GenomicsGürsoy G, Emani P, Brannon CM, Jolanki OA, Harmanci A, Strattan JS, Cherry JM, Miranker AD, Gerstein M. Data Sanitization to Reduce Private Information Leakage from Functional Genomics Cell 2020, 183: 905-917.e16. PMID: 33186529, PMCID: PMC7672785, DOI: 10.1016/j.cell.2020.09.036.
- Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and ConsequencesKumar S, Warrell J, Li S, McGillivray PD, Meyerson W, Salichos L, Harmanci A, Martinez-Fundichely A, Chan CWY, Nielsen MM, Lochovsky L, Zhang Y, Li X, Lou S, Pedersen JS, Herrmann C, Getz G, Khurana E, Gerstein MB. Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences Cell 2020, 180: 915-927.e16. PMID: 32084333, PMCID: PMC7210002, DOI: 10.1016/j.cell.2020.01.032.
- The real cost of sequencing: scaling computation to keep pace with data generationMuir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, Gerstein M. The real cost of sequencing: scaling computation to keep pace with data generation Genome Biology 2016, 17: 53. PMID: 27009100, PMCID: PMC4806511, DOI: 10.1186/s13059-016-0917-0.
- Temporal Dynamics of Collaborative Networks in Large Scientific ConsortiaWang D, Yan KK, Rozowsky J, Pan E, Gerstein M. Temporal Dynamics of Collaborative Networks in Large Scientific Consortia Trends In Genetics 2016, 32: 251-253. PMID: 27005445, DOI: 10.1016/j.tig.2016.02.006.
- Quantification of private information leakage from phenotype-genotype data: linking attacksHarmanci A, Gerstein M. Quantification of private information leakage from phenotype-genotype data: linking attacks Nature Methods 2016, 13: 251-256. PMID: 26828419, PMCID: PMC4834871, DOI: 10.1038/nmeth.3746.
- Comparative analysis of the transcriptome across distant speciesGerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, Pei B, Harmanci AO, Duff MO, Djebali S, Alexander RP, Alver BH, Auerbach R, Bell K, Bickel PJ, Boeck ME, Boley NP, Booth BW, Cherbas L, Cherbas P, Di C, Dobin A, Drenkow J, Ewing B, Fang G, Fastuca M, Feingold EA, Frankish A, Gao G, Good PJ, Guigó R, Hammonds A, Harrow J, Hoskins RA, Howald C, Hu L, Huang H, Hubbard TJ, Huynh C, Jha S, Kasper D, Kato M, Kaufman TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, Leng J, Lu Z, MacCoss M, May G, McWhirter R, Merrihew G, Miller DM, Mortazavi A, Murad R, Oliver B, Olson S, Park PJ, Pazin MJ, Perrimon N, Pervouchine D, Reinke V, Reymond A, Robinson G, Samsonova A, Saunders GI, Schlesinger F, Sethi A, Slack FJ, Spencer WC, Stoiber MH, Strasbourger P, Tanzer A, Thompson OA, Wan KH, Wang G, Wang H, Watkins KL, Wen J, Wen K, Xue C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng H, Brenner SE, Graveley BR, Celniker SE, Gingeras TR, Waterston R. Comparative analysis of the transcriptome across distant species Nature 2014, 512: 445-448. PMID: 25164755, PMCID: PMC4155737, DOI: 10.1038/nature13424.
- Comparative analysis of pseudogenes across three phylaSisu C, Pei B, Leng J, Frankish A, Zhang Y, Balasubramanian S, Harte R, Wang D, Rutenberg-Schoenberg M, Clark W, Diekhans M, Rozowsky J, Hubbard T, Harrow J, Gerstein MB. Comparative analysis of pseudogenes across three phyla Proceedings Of The National Academy Of Sciences Of The United States Of America 2014, 111: 13361-13366. PMID: 25157146, PMCID: PMC4169933, DOI: 10.1073/pnas.1407293111.
- Integrative Annotation of Variants from 1092 Humans: Application to Cancer GenomicsKhurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, Das J, Abyzov A, Balasubramanian S, Beal K, Chakravarty D, Challis D, Chen Y, Clarke D, Clarke L, Cunningham F, Evani US, Flicek P, Fragoza R, Garrison E, Gibbs R, Gümüş ZH, Herrero J, Kitabayashi N, Kong Y, Lage K, Liluashvili V, Lipkin SM, MacArthur DG, Marth G, Muzny D, Pers TH, Ritchie GRS, Rosenfeld JA, Sisu C, Wei X, Wilson M, Xue Y, Yu F, Consortium 1, Dermitzakis ET, Yu H, Rubin MA, Tyler-Smith C, Gerstein M. Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics Science 2013, 342: 1235587. PMID: 24092746, PMCID: PMC3947637, DOI: 10.1126/science.1235587.
- Architecture of the human regulatory network derived from ENCODE dataGerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from ENCODE data Nature 2012, 489: 91-100. PMID: 22955619, PMCID: PMC4154057, DOI: 10.1038/nature11245.
- Genomics and Privacy: Implications of the New Reality of Closed Data for the FieldGreenbaum D, Sboner A, Mu XJ, Gerstein M. Genomics and Privacy: Implications of the New Reality of Closed Data for the Field PLOS Computational Biology 2011, 7: e1002278. PMID: 22144881, PMCID: PMC3228779, DOI: 10.1371/journal.pcbi.1002278.
- Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE ProjectGerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, Alves P, Chateigner A, Perry M, Morris M, Auerbach RK, Feng X, Leng J, Vielle A, Niu W, Rhrissorrakrai K, Agarwal A, Alexander RP, Barber G, Brdlik CM, Brennan J, Brouillet JJ, Carr A, Cheung MS, Clawson H, Contrino S, Dannenberg LO, Dernburg AF, Desai A, Dick L, Dosé AC, Du J, Egelhofer T, Ercan S, Euskirchen G, Ewing B, Feingold EA, Gassmann R, Good PJ, Green P, Gullier F, Gutwein M, Guyer MS, Habegger L, Han T, Henikoff JG, Henz SR, Hinrichs A, Holster H, Hyman T, Iniguez AL, Janette J, Jensen M, Kato M, Kent WJ, Kephart E, Khivansara V, Khurana E, Kim JK, Kolasinska-Zwierz P, Lai EC, Latorre I, Leahey A, Lewis S, Lloyd P, Lochovsky L, Lowdon RF, Lubling Y, Lyne R, MacCoss M, Mackowiak SD, Mangone M, McKay S, Mecenas D, Merrihew G, Miller DM, Muroyama A, Murray JI, Ooi SL, Pham H, Phippen T, Preston EA, Rajewsky N, Rätsch G, Rosenbaum H, Rozowsky J, Rutherford K, Ruzanov P, Sarov M, Sasidharan R, Sboner A, Scheid P, Segal E, Shin H, Shou C, Slack FJ, Slightam C, Smith R, Spencer WC, Stinson EO, Taing S, Takasaki T, Vafeados D, Voronina K, Wang G, Washington NL, Whittle CM, Wu B, Yan KK, Zeller G, Zha Z, Zhong M, Zhou X, Consortium M, Ahringer J, Strome S, Gunsalus KC, Micklem G, Liu XS, Reinke V, Kim SK, Hillier LW, Henikoff S, Piano F, Snyder M, Stein L, Lieb JD, Waterston RH. Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project Science 2010, 330: 1775-1787. PMID: 21177976, PMCID: PMC3142569, DOI: 10.1126/science.1196914.
- Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networksYan KK, Fang G, Bhardwaj N, Alexander RP, Gerstein M. Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks Proceedings Of The National Academy Of Sciences Of The United States Of America 2010, 107: 9186-9191. PMID: 20439753, PMCID: PMC2889091, DOI: 10.1073/pnas.0914771107.
- PeakSeq enables systematic scoring of ChIP-seq experiments relative to controlsRozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls Nature Biotechnology 2009, 27: 66-75. PMID: 19122651, PMCID: PMC2924752, DOI: 10.1038/nbt.1518.
- Structured digital abstract makes text mining easyGerstein M, Seringhaus M, Fields S. Structured digital abstract makes text mining easy Nature 2007, 447: 142-142. PMID: 17495904, DOI: 10.1038/447142a.
- Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary InsightsKim PM, Lu LJ, Xia Y, Gerstein MB. Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary Insights Science 2006, 314: 1938-1941. PMID: 17185604, DOI: 10.1126/science.1136174.
- Genomic analysis of the hierarchical structure of regulatory networksYu H, Gerstein M. Genomic analysis of the hierarchical structure of regulatory networks Proceedings Of The National Academy Of Sciences Of The United States Of America 2006, 103: 14724-14731. PMID: 17003135, PMCID: PMC1595419, DOI: 10.1073/pnas.0508637103.
- The Real Life of PseudogenesGerstein M, Zheng D. The Real Life of Pseudogenes Scientific American 2006, 295: 48-55. PMID: 16866288, DOI: 10.1038/scientificamerican0806-48.
- Calculation of Standard Atomic Volumes for RNA and Comparison with Proteins: RNA is Packed More TightlyVoss NR, Gerstein M. Calculation of Standard Atomic Volumes for RNA and Comparison with Proteins: RNA is Packed More Tightly Journal Of Molecular Biology 2005, 346: 477-492. PMID: 15670598, DOI: 10.1016/j.jmb.2004.11.072.
- Genomic analysis of regulatory network dynamics reveals large topological changesLuscombe NM, Madan Babu M, Yu H, Snyder M, Teichmann SA, Gerstein M. Genomic analysis of regulatory network dynamics reveals large topological changes Nature 2004, 431: 308-312. PMID: 15372033, DOI: 10.1038/nature02782.
- MolMovDB: analysis and visualization of conformational change and structural flexibilityEchols N, Milburn D, Gerstein M. MolMovDB: analysis and visualization of conformational change and structural flexibility Nucleic Acids Research 2003, 31: 478-482. PMID: 12520056, PMCID: PMC165551, DOI: 10.1093/nar/gkg104.