Skip to Main Content

Xiting Yan, PhD

Associate Professor of Medicine (Pulmonary, Critical Care and Sleep Medicine)
DownloadHi-Res Photo

Additional Titles

Director of Data Analysis and Bioinformatics Hub, The Center for Precision Pulmonary Medicine (P2MED)

Assistant Professor, Biostatistics

About

Titles

Associate Professor of Medicine (Pulmonary, Critical Care and Sleep Medicine)

Director of Data Analysis and Bioinformatics Hub, The Center for Precision Pulmonary Medicine (P2MED); Assistant Professor, Biostatistics

Biography

Dr. Yan received doctoral degrees in both applied statistics and computational biology and bioinformatics. She is interested in genetics, genomics, computational biology, biostatistics, system biology and bioinformatics. Her current research topics include (1) understanding disease heterogeneity and pathogenesis using large-scale omics data at both bulk and single cell resolution and (2) developing novel statistical and computational methods for analyses of different types of omics data and the integration of them with drug perturbation data for potential personalized treatment design.

Appointments

Other Departments & Organizations

Education & Training

Postdoctoral Associate
Yale School of Medicine (2010)
PhD
University of Southern California, Biological Science Department/Computational Biology and Bioinformatics (2009)
PhD
Peking University, Department of Probability and Statistics, School of Mathematical Sciences/Applied Statistics (2006)
BS
Peking University, Department of Probability and Statistics, School of Mathematical Sciences/Probability and Statistics (2001)

Research

Overview

Understanding the pathogenesis and progression of chronic lung diseases is critical for therapeutic development. Different types of OMICs data, including genetic, genomic, transcriptomic, epigenetic data and so on, provide rich, reproducible and mechanism indicating information for understanding disease pathogenesis and progression. However, OMICs data usually have high dimension, complicated data structure, high noise level, and complex interactions between features (genes, proteins, metabolites, etc.). The corresponding data analysis is challenging but critical to obtain biologically meaningful and reproducible discoveries.

My current research interest focus on two parts: (1) developing novel statistical and computational models to analyze large scale omics and drug perturbation data to better understand disease pathogenesis and precision medicine, and (2) understanding the heterogeneity, pathogenesis and progression of pulmonary diseases, such as asthma, idiopathic pulmonary fibrosis (IPF), sarcoidosis, pediatric cystic fibrosis and so on, by tailoring statistical and computational methods based on existing biological knowledge of the diseases.

My team has been involved in multiple transcriptomic studies of asthma, IPF, sarcoidosis, cystic fibrosis and lung injuries in pediatric patients undertaking cardio bypass procedure. These studies generated various types of large-scale transcriptomic data including microarray gene expression data, bulk RNA sequencing data, single cell RNA sequencing data, T cell receptor repertoire data, 16s rRNA sequencing data, spatial transcriptomic data and single-cell chromotin structural data. For each study, we tailed our computational and statistical analysis of the data based on existing biological knowledge of the corresponding disease or condition. These analyses have made various discoveries in asthma pathogenesis heterogeneity, cell type specific changes in asthma patients, heterogeneity and molecular biomarker of sarcoidosis, cell populations specific to IPF and COPD, potential antigen specific T cell clones for SARS-CoV-2 infection (COVID19) in adults and so on. My team is currently closely working with physicians and basic scientists to make further and more translational discoveries for the aforementioned and other pulmonary diseases.

Through the extensive analyses of various types of omics data generated by our collaborators, my team also identifies computational and statistical needs and develops novel methods to address these needs. Topics of computational tools we have developed include imputation of single-cell RNA sequencing data (G2S3), identifying differentially expressed genes from scRNA-seq data with mutliple subjects (iDESC), cell type deconvolution of spatial transcriptomic data (SDePER) and so on. The development of these computational tools further boosted our capacity and ability to analyze different types of OMICS data to better understand disease heterogeneity, pathogenesis and progression.

Specifically, topics of my research on developing novel computational and statistical tools for OMICs data analysis include the follows.

First, I designed an unsupervised clustering method to cluster patients based on pathway activity assessed using bulk gene expression data. Specifically, we cluster patients using expression levels of member genes in each pathway and summarize the clustering results across all pathways to define a distance score between patients. In this way, the clustering results will not be dominated by signals from "big" pathways that involve thousands of genes and prior knowledge of gene-gene interactions can be considered for better biological relevance of the results. Applications of the method to gene expression data in asthma patients identified three clinically meaningful groups of asthma patients, which were validated in an independent cohorts of pediatric asthma patients. Along this line, my team is currently working on developing deep learning models to identify disease heterogeneity from single cell expression data and spatial transcriptomic data with and without considering existing knowledge of biological pathways.

Second, despite of the great potential of scRNA-seq data in providing cell type-specific information on changes in diseases, there are excessive number of zeros in the data, reducing the information depth. Most data imputation methods use similar cells for imputation, which was shown to have over-smoothing problem. To address this, my research team designed a data imputation method, G2S3, to impute for the dropouts using similar genes. A gene-gene affinity network is learned from the scRNA-seq data and the imputation was conducted based on a lazy random walk using the affinity network.

Third, my research team developed a method to identify differentially expressed genes between two groups of subjectsfrom single cell RNA sequencing data. This question is fundamentally different from the question of identifying differentially expressed genes from two groups of cells. Because the separation between different cell populations has been shown to be the dominant variation in single cell RNA sequencing data. Even though the hierarchical data structure, i.e. some cells are from the same subject, is not considered, cell population markers can still be easily found. For certain cell types, cells from the same subject cluster together rather than with cells of the same type but from different subjects. This indicate that for these cell types, subject effect is the second dominant variation and needs to be considered when differentially expressed genes between two groups of individuals are identified. We designed iDESC to consider this subject effect and the high dropout rate in scRNA-seq data for this task. The results demonstrated highly inflated type I error rate when subject effect is not considered and the superior performance of iDESC.

Fourth, the spatial-barcoding based spatial transcriptomic measure gene expression levels unbiasely but not at single-cell resolution. Each capture spot may have unknown number of cells of unknown types. A straightforward way to estimate the cell type composition of cells in each capture spot is to deconvolute the data using existing scRNA-seq data from the same tissue type. However, scRNA-seq data and spatial transcriptomic data have systematic difference from each other, making it incorrect to assume that the observed spatial transcriptomic data is a linear combination of the cell type-specific expression profile from scRNA-seq data. In addition, comparing to the total number of cell types in the scRNA-seq data, the number of cell types existing in each capture spot is usually quite small. Lastly, the cell type composition of neighboring capture spots is highly correlated with each other due to the continuous consistency of tissues. To address all these challenges, we designed SDePER to estimate cell type composition for each capture spot with considering of the platform effects, sparsity and spatial correlation. Based on the estimated cell type composition, SDePER also imputes the expression data for the spatial map at an enhanced resolution. This is the first method designed to use machine learning methods to correct for platform effects, which was shown to be more efficient than the batch correction methods and contribute the most to the performance boost.

Medical Research Interests

Biostatistics; Computational Biology; Genetics; Genomics; Lung Diseases; Molecular Biology; Molecular Medicine; Respiratory Hypersensitivity

Public Health Interests

Microbial Ecology; Modeling; Bioinformatics; Biomarkers; Genetics, Genomics, Epigenetics

Research at a Glance

Yale Co-Authors

Frequent collaborators of Xiting Yan's published research.

Publications

Featured Publications

Academic Achievements & Community Involvement

  • activity

    The Journal of Allergy and Clinical Immunology

  • activity

    BMC Bioinformatics

  • activity

    Bioinformatics

  • activity

    Computational Methods for Single-cell RNA Sequencing and Spatial Transcriptomic Data Analysis for Precision Medicine

  • activity

    Understanding disease heterogeneity of asthma using a pathway-based distance score for gene expression data

Teaching & Mentoring

Mentoring

  • Siming Zheng

    Postdoc
    2024 - Present
  • Huanhuan Wei

    Postdoc
    2023 - 2026
  • Yuening Zhang

    Postdoc
    2023 - Present

Get In Touch

Contacts

Academic Office Number