Xiting Yan, PhD

Associate Professor of Medicine (Pulmonary, Critical Care and Sleep Medicine)

DownloadHi-Res Photo

Appointments

Pulmonary, Critical Care & Sleep Medicine

Primary

Biostatistics

Secondary

Additional Titles

Director of Data Analysis and Bioinformatics Hub, The Center for Precision Pulmonary Medicine (P2MED)

Assistant Professor, Biostatistics

Contact Info

xiting.yan@yale.edu

203.785.5567

About

Titles

Associate Professor of Medicine (Pulmonary, Critical Care and Sleep Medicine)

Director of Data Analysis and Bioinformatics Hub, The Center for Precision Pulmonary Medicine (P2MED); Assistant Professor, Biostatistics

Biography

Dr. Yan received doctoral degrees in both applied statistics and computational biology and bioinformatics. She is interested in genetics, genomics, computational biology, biostatistics, system biology and bioinformatics. Her current research topics include (1) understanding disease heterogeneity and pathogenesis using large-scale omics data at both bulk and single cell resolution and (2) developing novel statistical and computational methods for analyses of different types of omics data and the integration of them with drug perturbation data for potential personalized treatment design.

Appointments

Pulmonary, Critical Care & Sleep Medicine
Associate Professor on Term
Primary
Pulmonary, Critical Care & Sleep Medicine
Biostatistics
Assistant Professor
Secondary
Biostatistics

Biostatistics
Center for Biomedical Data Science
Internal Medicine
Kaminski Lab
Pulmonary, Critical Care & Sleep Medicine
The Center for Precision Pulmonary Medicine (P2MED)
Yale School of Public Health

Education & Training

Postdoctoral Associate: Yale School of Medicine (2010)

PhD: University of Southern California, Biological Science Department/Computational Biology and Bioinformatics (2009)

PhD: Peking University, Department of Probability and Statistics, School of Mathematical Sciences/Applied Statistics (2006)

BS: Peking University, Department of Probability and Statistics, School of Mathematical Sciences/Probability and Statistics (2001)

Research

Overview

Understanding the pathogenesis and progression of chronic lung diseases is critical for therapeutic development. Different types of OMICs data, including genetic, genomic, transcriptomic, epigenetic data and so on, provide rich, reproducible and mechanism indicating information for understanding disease pathogenesis and progression. However, OMICs data usually have high dimension, complicated data structure, high noise level, and complex interactions between features (genes, proteins, metabolites, etc.). The corresponding data analysis is challenging but critical to obtain biologically meaningful and reproducible discoveries.

My current research interest focus on two parts: (1) developing novel statistical and computational models to analyze large scale omics and drug perturbation data to better understand disease pathogenesis and precision medicine, and (2) understanding the heterogeneity, pathogenesis and progression of pulmonary diseases, such as asthma, idiopathic pulmonary fibrosis (IPF), sarcoidosis, pediatric cystic fibrosis and so on, by tailoring statistical and computational methods based on existing biological knowledge of the diseases.

My team has been involved in multiple transcriptomic studies of asthma, IPF, sarcoidosis, cystic fibrosis and lung injuries in pediatric patients undertaking cardio bypass procedure. These studies generated various types of large-scale transcriptomic data including microarray gene expression data, bulk RNA sequencing data, single cell RNA sequencing data, T cell receptor repertoire data, 16s rRNA sequencing data, spatial transcriptomic data and single-cell chromotin structural data. For each study, we tailed our computational and statistical analysis of the data based on existing biological knowledge of the corresponding disease or condition. These analyses have made various discoveries in asthma pathogenesis heterogeneity, cell type specific changes in asthma patients, heterogeneity and molecular biomarker of sarcoidosis, cell populations specific to IPF and COPD, potential antigen specific T cell clones for SARS-CoV-2 infection (COVID19) in adults and so on. My team is currently closely working with physicians and basic scientists to make further and more translational discoveries for the aforementioned and other pulmonary diseases.

Through the extensive analyses of various types of omics data generated by our collaborators, my team also identifies computational and statistical needs and develops novel methods to address these needs. Topics of computational tools we have developed include imputation of single-cell RNA sequencing data (G2S3), identifying differentially expressed genes from scRNA-seq data with mutliple subjects (iDESC), cell type deconvolution of spatial transcriptomic data (SDePER) and so on. The development of these computational tools further boosted our capacity and ability to analyze different types of OMICS data to better understand disease heterogeneity, pathogenesis and progression.

Specifically, topics of my research on developing novel computational and statistical tools for OMICs data analysis include the follows.

First, I designed an unsupervised clustering method to cluster patients based on pathway activity assessed using bulk gene expression data. Specifically, we cluster patients using expression levels of member genes in each pathway and summarize the clustering results across all pathways to define a distance score between patients. In this way, the clustering results will not be dominated by signals from "big" pathways that involve thousands of genes and prior knowledge of gene-gene interactions can be considered for better biological relevance of the results. Applications of the method to gene expression data in asthma patients identified three clinically meaningful groups of asthma patients, which were validated in an independent cohorts of pediatric asthma patients. Along this line, my team is currently working on developing deep learning models to identify disease heterogeneity from single cell expression data and spatial transcriptomic data with and without considering existing knowledge of biological pathways.

Second, despite of the great potential of scRNA-seq data in providing cell type-specific information on changes in diseases, there are excessive number of zeros in the data, reducing the information depth. Most data imputation methods use similar cells for imputation, which was shown to have over-smoothing problem. To address this, my research team designed a data imputation method, G2S3, to impute for the dropouts using similar genes. A gene-gene affinity network is learned from the scRNA-seq data and the imputation was conducted based on a lazy random walk using the affinity network.

Third, my research team developed a method to identify differentially expressed genes between two groups of subjectsfrom single cell RNA sequencing data. This question is fundamentally different from the question of identifying differentially expressed genes from two groups of cells. Because the separation between different cell populations has been shown to be the dominant variation in single cell RNA sequencing data. Even though the hierarchical data structure, i.e. some cells are from the same subject, is not considered, cell population markers can still be easily found. For certain cell types, cells from the same subject cluster together rather than with cells of the same type but from different subjects. This indicate that for these cell types, subject effect is the second dominant variation and needs to be considered when differentially expressed genes between two groups of individuals are identified. We designed iDESC to consider this subject effect and the high dropout rate in scRNA-seq data for this task. The results demonstrated highly inflated type I error rate when subject effect is not considered and the superior performance of iDESC.

Fourth, the spatial-barcoding based spatial transcriptomic measure gene expression levels unbiasely but not at single-cell resolution. Each capture spot may have unknown number of cells of unknown types. A straightforward way to estimate the cell type composition of cells in each capture spot is to deconvolute the data using existing scRNA-seq data from the same tissue type. However, scRNA-seq data and spatial transcriptomic data have systematic difference from each other, making it incorrect to assume that the observed spatial transcriptomic data is a linear combination of the cell type-specific expression profile from scRNA-seq data. In addition, comparing to the total number of cell types in the scRNA-seq data, the number of cell types existing in each capture spot is usually quite small. Lastly, the cell type composition of neighboring capture spots is highly correlated with each other due to the continuous consistency of tissues. To address all these challenges, we designed SDePER to estimate cell type composition for each capture spot with considering of the platform effects, sparsity and spatial correlation. Based on the estimated cell type composition, SDePER also imputes the expression data for the spatial map at an enhanced resolution. This is the first method designed to use machine learning methods to correct for platform effects, which was shown to be more efficient than the batch correction methods and contribute the most to the performance boost.

Medical Research Interests

Biostatistics; Computational Biology; Genetics; Genomics; Lung Diseases; Molecular Biology; Molecular Medicine; Respiratory Hypersensitivity

Public Health Interests

Microbial Ecology; Modeling; Genetics, Genomics, Epigenetics; Biomarkers; Bioinformatics

ORCID
0000-0001-8688-9004

Research at a Glance

Yale Co-Authors

Frequent collaborators of Xiting Yan's published research.

Naftali Kaminski, MD
View Full Profile
View 27 Common Publications
Jose Gomez Villalobos, MD, MS
View Full Profile
View 14 Common Publications
Jonas Christian Schupp, MD
View Full Profile
View 13 Common Publications
Taylor Adams
View Full Profile
View 11 Common Publications
Geoffrey Chupp, MD
View Full Profile
View 9 Common Publications
Farida Ahangari, MD
View Full Profile
View 7 Common Publications

Publications

Featured Publications

See All Publications

Academic Achievements & Community Involvement

activity
The Journal of Allergy and Clinical Immunology
Journal ServiceReviewer
Details
activity
BMC Bioinformatics
Journal ServiceReviewer
Details
activity
Bioinformatics
Journal ServiceReviewer
Details
activity
Computational Methods for Single-cell RNA Sequencing and Spatial Transcriptomic Data Analysis for Precision Medicine
Oral PresentationDepartmental Biostatistics Seminar
Details
activity
Understanding disease heterogeneity of asthma using a pathway-based distance score for gene expression data
Oral PresentationBiostatistics Seminar
Details

See All Achievements

Teaching & Mentoring

Mentoring

Siming Zheng
Postdoc
2024 - Present
Huanhuan Wei
Postdoc
2023 - 2026
Yuening Zhang
Postdoc
2023 - Present

Willing and Available to Mentor
- Students
- Postdoctoral Researchers
- Clinical Fellows

News & Links

Media

B). Heatmap showing the clustering results by KEGG pathways using MCLUST. The color represents the clustering assignment of each sample by the KEGG pathways. C). Pathway based distance matrix among the clusters. The color of entry represents the pathway based distance between the corresponding two samples. Red represents a small distance (samples are strongly related) and white represents longer distance showing the strength of the clusters (samples are weakly related). Samples within TEA cluster 3 are the most strongly related and most homogeneous, followed by cluster 1 and 2, respectively.

News

See All News

Get In Touch

Contacts

xiting.yan@yale.edu

Academic Office Number

203.785.5567

Appointments

Additional Titles

Contact Info

Titles

Biography

Appointments

Pulmonary, Critical Care & Sleep Medicine

Biostatistics

Other Departments & Organizations

Education & Training

Overview

Medical Research Interests

Public Health Interests

ORCID

Research at a Glance

Yale Co-Authors

Naftali Kaminski, MD

Jose Gomez Villalobos, MD, MS

Jonas Christian Schupp, MD

Taylor Adams

Geoffrey Chupp, MD

Farida Ahangari, MD

Publications

Featured Publications

The Journal of Allergy and Clinical Immunology

BMC Bioinformatics

Bioinformatics

Computational Methods for Single-cell RNA Sequencing and Spatial Transcriptomic Data Analysis for Precision Medicine

Understanding disease heterogeneity of asthma using a pathway-based distance score for gene expression data

Mentoring

Siming Zheng

Huanhuan Wei

Yuening Zhang

Willing and Available to Mentor

Media

faculty_site_picture

News

Unique Immune Profile Identified in Fibrotic Hypersensitivity Pneumonitis

Department of Internal Medicine Promotions and Reappointments

Scientists Apply High-resolution, Single-cell Profiling to Understand Immune Response in Severe COVID-19

Despite the challenges of COVID-19, Yale-PCCSM section members continued their work on scientific papers

Contacts