Mark Gerstein, PhD

Albert L Williams Professor of Biomedical Informatics and Professor of Molecular Biophysics & Biochemistry, of Computer Science, and of Statistics & Data Science

DownloadHi-Res Photo

Appointments

Molecular Biophysics and Biochemistry

Fully Joint

Statistics

Fully Joint

Biomedical Informatics & Data Science

Secondary

Contact Info

mark.gerstein@yale.edu

203.432.6105

About

Titles

Albert L Williams Professor of Biomedical Informatics and Professor of Molecular Biophysics & Biochemistry, of Computer Science, and of Statistics & Data Science

Biography

After graduating from Harvard with an A.B. in physics in 1989, Prof. Mark Gerstein earned a doctorate in theoretical chemistry and biophysics from Cambridge University in 1993. He did postdoctoral research in bioinformatics at Stanford University from 1993 to 1996. He came to Yale in 1997 as an assistant professor and in 2003 became co-director of the Yale Computational Biology and Bioinformatics Program. Gerstein has published appreciably in the scientific literature, with an H index of >200 and >700 publications in total, including a number of them in prominent venues, such as Science, Nature, Cell, and Scientific American. His research is focused on biomedical data science, and he is particularly interested in machine learning, macromolecular simulation, human genome annotation & disease genomics, and genomic privacy.

Last Updated on August 02, 2025.

Appointments

Molecular Biophysics and Biochemistry
Professor
Fully Joint
Molecular Biophysics and Biochemistry
Statistics
Professor
Fully Joint
Statistics
Biomedical Informatics & Data Science
Professor
Secondary
Biomedical Informatics & Data Science

Biochemistry, Quantitative Biology, Biophysics and Structural Biology (BQBS)
Biomedical Informatics & Data Science
Center for RNA Science and Medicine
Colton Center for Autoimmunity
Computational Biology and Biomedical Informatics
Computational Biology and Bioinformatics
Genomics, Genetics, and Epigenetics
High Performance Computation
Keck
Molecular Biophysics and Biochemistry
NIDA Neuroproteomics Center
Program in Neurodevelopment and Regeneration
Statistics
Yale Cancer Center
Yale Center for Genomic Health
Yale Combined Program in the Biological and Biomedical Sciences (BBS)
Yale Ventures

Education & Training

PhD: Cambridge University (1992)

Research

Overview

The Gerstein lab has been engaged in biomedical data science for the past ~25 years – before the field had a defined name. We initially focused on macromolecular structure and physical simulation due to the availability of data and a well-developed calculational formalism. While we continue to work in these areas, the excitement surrounding the human genome has led us to increasingly focus on genomics. Overall, the lab serves as a connector, bridging the vast data generation in the biomedical sciences with analytic approaches from statistics and computer science, particularly AI-driven methods. Much of our work takes place within large consortia, such as ENCODE and 1000 Genomes.

Currently, our lab conducts analyses across multiple areas:

Genome Annotation, particularly in terms of Biological Networks

Annotating the human genome is a central focus of many in biomedicine. We have contributed significantly to this effort through active participation in worldwide collaborations such as ENCODE, modENCODE, and GENCODE, as well as through the development of computational approaches for processing bulk, single-cell, multi-omic, and spatial data. A main focus has been on identifying regulatory sites (e.g., enhancers), epigenetics, and coding and non-coding RNAs. We have developed integrative methods to put these together, predicting the expression of target genes in specific cell types from their upstream control regions. Additionally, we have contributed to annotating pseudogenes (fossil genes) and determining what they tell us about the history of the human genome. See encode/annotation and pseudogene papers.

Beyond direct annotation, we develop approaches to recast the genomic information into molecular networks. In particular, we build and analyze gene-regulatory networks, protein-protein interactions, cell-to-cell communication networks, and metabolic pathways, identifying key nodes, such as information-flow bottlenecks and the apexes of regulatory hierarchies. We integrate these networks with dynamic expression data, three-dimensional protein structures, and other functional data to uncover principles governing biological systems. Given that people often have stronger intuition for social and computer networks than for biological ones, we have found cross-disciplinary comparisons useful in elucidating system-level properties—such as the relationship between connectivity and evolutionary constraint. See network papers.

Disease Genomics (Neurogenomics & Cancer Genomics)

The declining cost of next-generation sequencing has enabled researchers to rapidly investigate the genomic contributions to an individual’s disease. We have contributed to this effort through comprehensive studies and computational approaches designed to link personal genomic variants to disease. Our research has spanned a wide range of diseases, with a particular focus on cancers and brain disorders. Recent efforts in cancer include developing tools to prioritize noncoding driver mutations, examine the collective impact of nondriver mutations and analyze the full mutational spectra. See cancer genomics papers.

In neurogenomics, we have developed a comprehensive functional genomic resource for the human brain in the PsychENCODE project, integrating single-cell data with bulk functional genomics datasets. We have used this to determine many eQTLs (expression QTLs) in both bulk and single-cell contexts. Furthermore, the resource has allowed us to construct predictive models linking genomic variants via chromatin activity and single-cell gene expression to observed organismal phenotypes for schizophrenia, bipolar disorder, and Alzheimer’s disease. This model enables us to highlight key genes and pathways in these disorders, potentially identifying drug targets. See neurogenomics papers.

Using Packing to Understand Macromolecular Dynamics

While non-coding regions play an important, if underappreciated, role in genome function and disease, we also characterize coding sequences and drill deep into their protein products. By analyzing protein motions, we can better predict how a mutation affects function. Our effort involves devising a system for characterizing motions in a standardized fashion in terms of key statistics, such as the degree of rotation around hinges. It is guided by the fact that tight packing highly restricts protein mobility. See molecular motion and structure papers.

Interpretable, Machine Learning Tools for Biomedical Data

A lot of our work is developing practical tools and software applications that can be used to tackle concrete biomedical problems. Often, these take the form of computational pipelines or web servers that encapsulate various statistical and machine-learning methods. A key distinguishing aspect of our tools is grounding them in physical principles and biological mechanisms to enhance interpretability and ensure alignment with established scientific knowledge. Examples of some of our recent tools include genomic pipelines for characterizing multi-scale “peaks” in chromatin activity data and identifying enhancers with recurrent neural networks, visualization servers for macromolecular motions and for the gene-regulatory hierarchy, and software for identifying protein misfolding by repurposing existing embeddings from large-language models. See key paper tools and Gerstein Lab repository on GitHub for more details.

Privacy of Genomic and Biomedical Data

Increasingly, one of the main limitations in genomic analysis is securing enough individuals for a properly powered analysis. This requires keeping many individuals’ biomedical data private. While this may seem straightforward, it is highly complex due to the high dimensionality and large scale of the data, particularly for genomes. We have developed various statistical methods to quantify the extent of private information leakage, including subtle and often overlooked risks (e.g., via linking attacks). Additionally, we have designed approaches to selectively sanitize and share data while minimizing the loss of its utility for downstream analyses. This includes secure data-sharing frameworks using homomorphic encryption and blockchain storage. See privacy papers.

Future Directions, Fusing Diverse Biomedical Data Modalities with AI Approaches

Going forward, we are trying to integrate the broader range of biomedical data coming online, including image data, biosensor data, and various forms of textual data from publications and electronic health records. We see a tremendous value in creatively fusing diverse data types, using the genome as an organizing platform. We have had notable progress in this recently, linking genetic variants to biosensor outputs (i.e., using smartwatch outputs in GWAS), developing ensemble machine-learning approaches for cryo-EM image processing, and developing large-language models for automatic bioinformatics code generation.

References

See Papers.GersteinLab.org – in particular, Best Papers and listing of Key Contributions.

Some talks giving a quick overview of the lab: 5′ Quick Overview (’25), 5′ animation (’20), 15′ powerpoint (’19)

More information on research interests can also be found here.

Medical Research Interests

Biochemistry; Biophysics; Computational Biology; Computer Simulation; Data Science; Deep Learning; DNA; Genomics; Image Processing, Computer-Assisted; Medical Informatics; Molecular Dynamics Simulation; Proteomics; Wearable Electronic Devices

ORCID
0000-0002-9746-3719
Gerstein Lab
View Lab Website

Research at a Glance

Yale Co-Authors

Frequent collaborators of Mark Gerstein's published research.

Publications

Featured Publications

Single-cell genomics and regulatory networks for 388 human brains
Emani P, Liu J, Clarke D, Jensen M, Warrell J, Gupta C, Meng R, Lee C, Xu S, Dursun C, Lou S, Chen Y, Chu Z, Galeev T, Hwang A, Li Y, Ni P, Zhou X, Bakken T, Bendl J, Bicks L, Chatterjee T, Cheng L, Cheng Y, Dai Y, Duan Z, Flaherty M, Fullard J, Gancz M, Garrido-Martín D, Gaynor-Gillett S, Grundman J, Hawken N, Henry E, Hoffman G, Huang A, Jiang Y, Jin T, Jorstad N, Kawaguchi R, Khullar S, Liu J, Liu J, Liu S, Ma S, Margolis M, Mazariegos S, Moore J, Moran J, Nguyen E, Phalke N, Pjanic M, Pratt H, Quintero D, Rajagopalan A, Riesenmy T, Shedd N, Shi M, Spector M, Terwilliger R, Travaglini K, Wamsley B, Wang G, Xia Y, Xiao S, Yang A, Zheng S, Gandal M, Lee D, Lein E, Roussos P, Sestan N, Weng Z, White K, Won H, Girgenti M, Zhang J, Wang D, Geschwind D, Gerstein M, Akbarian S, Abyzov A, Ahituv N, Arasappan D, Almagro Armenteros J, Beliveau B, Berretta S, Bharadwaj R, Bhattacharya A, Brennand K, Capauto D, Champagne F, Chatzinakos C, Chen H, Cheng L, Chess A, Chien J, Clement A, Collado-Torres L, Cooper G, Crawford G, Dai R, Daskalakis N, Davila-Velderrain J, Deep-Soboslay A, Deng C, DiPietro C, Dracheva S, Drusinsky S, Duong D, Eagles N, Edelstein J, Galani K, Girdhar K, Goes F, Greenleaf W, Guo H, Guo Q, Hadas Y, Hallmayer J, Han X, Haroutunian V, He C, Hicks S, Ho M, Ho L, Huang Y, Huuki-Myers L, Hyde T, Iatrou A, Inoue F, Jajoo A, Jiang L, Jin P, Jops C, Jourdon A, Kellis M, Kleinman J, Kleopoulos S, Kozlenkov A, Kriegstein A, Kundaje A, Kundu S, Li J, Li M, Lin X, Liu S, Liu C, Loupe J, Lu D, Ma L, Mariani J, Martinowich K, Maynard K, Myers R, Micallef C, Mikhailova T, Ming G, Mohammadi S, Monte E, Montgomery K, Mukamel E, Nairn A, Nemeroff C, Norton S, Nowakowski T, Omberg L, Page S, Park S, Patowary A, Pattni R, Pertea G, Peters M, Pinto D, Pochareddy S, Pollard K, Pollen A, Przytycki P, Purmann C, Qin Z, Qu P, Raj T, Reach S, Reimonn T, Ressler K, Ross D, Rozowsky J, Ruth M, Ruzicka W, Sanders S, Schneider J, Scuderi S, Sebra R, Seyfried N, Shao Z, Shieh A, Shin J, Skarica M, Snijders C, Song H, State M, Stein J, Steyert M, Subburaju S, Sudhof T, Snyder M, Tao R, Therrien K, Tsai L, Urban A, Vaccarino F, van Bakel H, Vo D, Voloudakis G, Wang T, Wang S, Wang Y, Wei Y, Weimer A, Weinberger D, Wen C, Whalen S, Willsey A, Wong W, Wu H, Wu F, Wuchty S, Wylie D, Yap C, Zeng B, Zhang P, Zhang C, Zhang B, Zhang Y, Ziffra R, Zeier Z, Zintel T. Single-cell genomics and regulatory networks for 388 human brains. Science 2024, 384: eadi5199-eadi5199. PMID: 38781369, PMCID: PMC11365579, DOI: 10.1126/science.adi5199.
Peer-Reviewed Original Research
MeSH Keywords and Concepts
Digital phenotyping from wearables using AI characterizes psychiatric disorders and identifies genetic associations
Liu J, Borsari B, Li Y, Liu S, Gao Y, Xin X, Lou S, Jensen M, Garrido-Martín D, Verplaetse T, Ash G, Zhang J, Girgenti M, Roberts W, Gerstein M. Digital phenotyping from wearables using AI characterizes psychiatric disorders and identifies genetic associations. Cell 2024, 188: 515-529.e15. PMID: 39706190, PMCID: PMC12278733, DOI: 10.1016/j.cell.2024.11.012.
Peer-Reviewed Original Research
Concepts
The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models
Rozowsky J, Gao J, Borsari B, Yang Y, Galeev T, Gürsoy G, Epstein C, Xiong K, Xu J, Li T, Liu J, Yu K, Berthel A, Chen Z, Navarro F, Sun M, Wright J, Chang J, Cameron C, Shoresh N, Gaskell E, Drenkow J, Adrian J, Aganezov S, Aguet F, Balderrama-Gutierrez G, Banskota S, Corona G, Chee S, Chhetri S, Cortez Martins G, Danyko C, Davis C, Farid D, Farrell N, Gabdank I, Gofin Y, Gorkin D, Gu M, Hecht V, Hitz B, Issner R, Jiang Y, Kirsche M, Kong X, Lam B, Li S, Li B, Li X, Lin K, Luo R, Mackiewicz M, Meng R, Moore J, Mudge J, Nelson N, Nusbaum C, Popov I, Pratt H, Qiu Y, Ramakrishnan S, Raymond J, Salichos L, Scavelli A, Schreiber J, Sedlazeck F, See L, Sherman R, Shi X, Shi M, Sloan C, Strattan J, Tan Z, Tanaka F, Vlasova A, Wang J, Werner J, Williams B, Xu M, Yan C, Yu L, Zaleski C, Zhang J, Ardlie K, Cherry J, Mendenhall E, Noble W, Weng Z, Levine M, Dobin A, Wold B, Mortazavi A, Ren B, Gillis J, Myers R, Snyder M, Choudhary J, Milosavljevic A, Schatz M, Bernstein B, Guigó R, Gingeras T, Gerstein M. The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models. Cell 2023, 186: 1493-1511.e40. PMID: 37001506, PMCID: PMC10074325, DOI: 10.1016/j.cell.2023.02.018.
Peer-Reviewed Original Research
MeSH Keywords and Concepts
Data Sanitization to Reduce Private Information Leakage from Functional Genomics
Gürsoy G, Emani P, Brannon CM, Jolanki OA, Harmanci A, Strattan JS, Cherry JM, Miranker AD, Gerstein M. Data Sanitization to Reduce Private Information Leakage from Functional Genomics. Cell 2020, 183: 905-917.e16. PMID: 33186529, PMCID: PMC7672785, DOI: 10.1016/j.cell.2020.09.036.
Peer-Reviewed Original Research
MeSH Keywords and Concepts
Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences
Kumar S, Warrell J, Li S, McGillivray PD, Meyerson W, Salichos L, Harmanci A, Martinez-Fundichely A, Chan CWY, Nielsen MM, Lochovsky L, Zhang Y, Li X, Lou S, Pedersen JS, Herrmann C, Getz G, Khurana E, Gerstein MB. Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences. Cell 2020, 180: 915-927.e16. PMID: 32084333, PMCID: PMC7210002, DOI: 10.1016/j.cell.2020.01.032.
Peer-Reviewed Original Research
MeSH Keywords
Comprehensive functional genomic resource and integrative model for the human brain
Wang D, Liu S, Warrell J, Won H, Shi X, Navarro FCP, Clarke D, Gu M, Emani P, Yang YT, Xu M, Gandal MJ, Lou S, Zhang J, Park JJ, Yan C, Rhie SK, Manakongtreecheep K, Zhou H, Nathan A, Peters M, Mattei E, Fitzgerald D, Brunetti T, Moore J, Jiang Y, Girdhar K, Hoffman GE, Kalayci S, Gümüş ZH, Crawford GE, Ashley-Koch A, Crawford G, Garrett M, Song L, Safi A, Johnson G, Wray G, Reddy T, Goes F, Zandi P, Bryois J, Jaffe A, Price A, Ivanov N, Collado-Torres L, Hyde T, Burke E, Kleiman J, Tao R, Shin J, Akbarian S, Girdhar K, Jiang Y, Kundakovic M, Brown L, Kassim B, Park R, Wiseman J, Zharovsky E, Jacobov R, Devillers O, Flatow E, Hoffman G, Lipska B, Lewis D, Haroutunian V, Hahn C, Charney A, Dracheva S, Kozlenkov A, Belmont J, DelValle D, Francoeur N, Hadjimichael E, Pinto D, van Bakel H, Roussos P, Fullard J, Bendl J, Hauberg M, Mangravite L, Peters M, Chae Y, Peng J, Niu M, Wang X, Webster M, Beach T, Chen C, Jiang Y, Dai R, Shieh A, Liu C, Grennan K, Xia Y, Vadukapuram R, Wang Y, Fitzgerald D, Cheng L, Brown M, Brown M, Brunetti T, Goodman T, Alsayed M, Gandal M, Geschwind D, Won H, Polioudakis D, Wamsley B, Yin J, Hadzic T, De La Torre Ubieta L, Swarup V, Sanders S, State M, Werling D, An J, Sheppard B, Willsey A, White K, Ray M, Giase G, Kefi A, Mattei E, Purcaro M, Weng Z, Moore J, Pratt H, Huey J, Borrman T, Sullivan P, Giusti-Rodriguez P, Kim Y, Sullivan P, Szatkiewicz J, Rhie S, Armoskus C, Camarena A, Farnham P, Spitsyna V, Witt H, Schreiner S, Evgrafov O, Knowles J, Gerstein M, Liu S, Wang D, Navarro F, Warrell J, Clarke D, Emani P, Gu M, Shi X, Xu M, Yang Y, Kitchen R, Gürsoy G, Zhang J, Carlyle B, Nairn A, Li M, Pochareddy S, Sestan N, Skarica M, Li Z, Sousa A, Santpere G, Choi J, Zhu Y, Gao T, Miller D, Cherskov A, Yang M, Amiri A, Coppola G, Mariani J, Scuderi S, Szekely A, Vaccarino F, Wu F, Weissman S, Roychowdhury T, Abyzov A, Roussos P, Akbarian S, Jaffe A, White K, Weng Z, Sestan N, Geschwind D, Knowles J, Gerstein M. Comprehensive functional genomic resource and integrative model for the human brain. Science 2018, 362: eaat8464. PMID: 30545857, PMCID: PMC6413328, DOI: 10.1126/science.aat8464.
Peer-Reviewed Original Research
MeSH Keywords and Concepts
The real cost of sequencing: scaling computation to keep pace with data generation
Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, Gerstein M. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biology 2016, 17: 53. PMID: 27009100, PMCID: PMC4806511, DOI: 10.1186/s13059-016-0917-0.
Peer-Reviewed Original Research
MeSH Keywords
Temporal Dynamics of Collaborative Networks in Large Scientific Consortia
Wang D, Yan KK, Rozowsky J, Pan E, Gerstein M. Temporal Dynamics of Collaborative Networks in Large Scientific Consortia. Trends In Genetics 2016, 32: 251-253. PMID: 27005445, DOI: 10.1016/j.tig.2016.02.006.
Peer-Reviewed Original Research
MeSH Keywords
Comparative analysis of the transcriptome across distant species
Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, Pei B, Harmanci AO, Duff MO, Djebali S, Alexander RP, Alver BH, Auerbach R, Bell K, Bickel PJ, Boeck ME, Boley NP, Booth BW, Cherbas L, Cherbas P, Di C, Dobin A, Drenkow J, Ewing B, Fang G, Fastuca M, Feingold EA, Frankish A, Gao G, Good PJ, Guigó R, Hammonds A, Harrow J, Hoskins RA, Howald C, Hu L, Huang H, Hubbard TJ, Huynh C, Jha S, Kasper D, Kato M, Kaufman TC, Kitchen RR, Ladewig E, Lagarde J, Lai E, Leng J, Lu Z, MacCoss M, May G, McWhirter R, Merrihew G, Miller DM, Mortazavi A, Murad R, Oliver B, Olson S, Park PJ, Pazin MJ, Perrimon N, Pervouchine D, Reinke V, Reymond A, Robinson G, Samsonova A, Saunders GI, Schlesinger F, Sethi A, Slack FJ, Spencer WC, Stoiber MH, Strasbourger P, Tanzer A, Thompson OA, Wan KH, Wang G, Wang H, Watkins KL, Wen J, Wen K, Xue C, Yang L, Yip K, Zaleski C, Zhang Y, Zheng H, Brenner SE, Graveley BR, Celniker SE, Gingeras TR, Waterston R. Comparative analysis of the transcriptome across distant species. Nature 2014, 512: 445-448. PMID: 25164755, PMCID: PMC4155737, DOI: 10.1038/nature13424.
Commentaries, Editorials and Letters
MeSH Keywords and Concepts
Comparative analysis of pseudogenes across three phyla
Sisu C, Pei B, Leng J, Frankish A, Zhang Y, Balasubramanian S, Harte R, Wang D, Rutenberg-Schoenberg M, Clark W, Diekhans M, Rozowsky J, Hubbard T, Harrow J, Gerstein MB. Comparative analysis of pseudogenes across three phyla. Proceedings Of The National Academy Of Sciences Of The United States Of America 2014, 111: 13361-13366. PMID: 25157146, PMCID: PMC4169933, DOI: 10.1073/pnas.1407293111.
Peer-Reviewed Original Research
MeSH Keywords and Concepts

See All Publications

News

See All News

Get In Touch

Contacts

mark.gerstein@yale.edu

Academic Office Number

203.432.6105

Secondary Academic Office Number

203.432.8189

Locations

Bass Center
Academic Office
266 Whitney Avenue, Rm 432A
New Haven, CT 06511
Get Directions

Appointments

Contact Info

Titles

Biography

Appointments

Molecular Biophysics and Biochemistry

Statistics

Biomedical Informatics & Data Science

Other Departments & Organizations

Education & Training

Overview

Genome Annotation, particularly in terms of Biological Networks

Disease Genomics (Neurogenomics & Cancer Genomics)

Using Packing to Understand Macromolecular Dynamics

Interpretable, Machine Learning Tools for Biomedical Data

Privacy of Genomic and Biomedical Data

Future Directions, Fusing Diverse Biomedical Data Modalities with AI Approaches

References

Medical Research Interests

ORCID

Gerstein Lab

Research at a Glance

Yale Co-Authors

Prashant Emani

Alexej Abyzov, PhD

Jonathan Warrell

Nenad Sestan, MD, PhD

Angus Nairn, PhD

Kristen Brennand, PhD

Publications

Featured Publications

News

How Many LLMs Does it Take to Reason Through a Decision?

Uncovering the Hidden Cellular Connections that Bridge Aging and Disease

YSM Researchers Recognized with Yale Faculty Innovation Awards

From Lab to Launch: Eleven Faculty Innovators Recognized for Turning Research into Real-World Impact

Contacts

Locations

Bass Center