Xiting Yan, PhD
Associate Professor of Medicine (Pulmonary, Critical Care and Sleep Medicine)Cards
About
Research
Overview
Understanding the pathogenesis and progression of chronic lung diseases is critical for therapeutic development. Different types of OMICs data, including genetic, genomic, transcriptomic, epigenetic data and so on, provide rich, reproducible and mechanism indicating information for understanding disease pathogenesis and progression. However, OMICs data usually have high dimension, complicated data structure, high noise level, and complex interactions between features (genes, proteins, metabolites, etc.). The corresponding data analysis is challenging but critical to obtain biologically meaningful and reproducible discoveries.
My current research interest focus on two parts: (1) developing novel statistical and computational models to analyze large scale omics and drug perturbation data to better understand disease pathogenesis and precision medicine, and (2) understanding the heterogeneity, pathogenesis and progression of pulmonary diseases, such as asthma, idiopathic pulmonary fibrosis (IPF), sarcoidosis, pediatric cystic fibrosis and so on, by tailoring statistical and computational methods based on existing biological knowledge of the diseases.
My team has been involved in multiple transcriptomic studies of asthma, IPF, sarcoidosis, cystic fibrosis and lung injuries in pediatric patients undertaking cardio bypass procedure. These studies generated various types of large-scale transcriptomic data including microarray gene expression data, bulk RNA sequencing data, single cell RNA sequencing data, T cell receptor repertoire data, 16s rRNA sequencing data, spatial transcriptomic data and single-cell chromotin structural data. For each study, we tailed our computational and statistical analysis of the data based on existing biological knowledge of the corresponding disease or condition. These analyses have made various discoveries in asthma pathogenesis heterogeneity, cell type specific changes in asthma patients, heterogeneity and molecular biomarker of sarcoidosis, cell populations specific to IPF and COPD, potential antigen specific T cell clones for SARS-CoV-2 infection (COVID19) in adults and so on. My team is currently closely working with physicians and basic scientists to make further and more translational discoveries for the aforementioned and other pulmonary diseases.
Through the extensive analyses of various types of omics data generated by our collaborators, my team also identifies computational and statistical needs and develops novel methods to address these needs. Topics of computational tools we have developed include imputation of single-cell RNA sequencing data (G2S3), identifying differentially expressed genes from scRNA-seq data with mutliple subjects (iDESC), cell type deconvolution of spatial transcriptomic data (SDePER) and so on. The development of these computational tools further boosted our capacity and ability to analyze different types of OMICS data to better understand disease heterogeneity, pathogenesis and progression.
Specifically, topics of my research on developing novel computational and statistical tools for OMICs data analysis include the follows.
First, I designed an unsupervised clustering method to cluster patients based on pathway activity assessed using bulk gene expression data. Specifically, we cluster patients using expression levels of member genes in each pathway and summarize the clustering results across all pathways to define a distance score between patients. In this way, the clustering results will not be dominated by signals from "big" pathways that involve thousands of genes and prior knowledge of gene-gene interactions can be considered for better biological relevance of the results. Applications of the method to gene expression data in asthma patients identified three clinically meaningful groups of asthma patients, which were validated in an independent cohorts of pediatric asthma patients. Along this line, my team is currently working on developing deep learning models to identify disease heterogeneity from single cell expression data and spatial transcriptomic data with and without considering existing knowledge of biological pathways.
Second, despite of the great potential of scRNA-seq data in providing cell type-specific information on changes in diseases, there are excessive number of zeros in the data, reducing the information depth. Most data imputation methods use similar cells for imputation, which was shown to have over-smoothing problem. To address this, my research team designed a data imputation method, G2S3, to impute for the dropouts using similar genes. A gene-gene affinity network is learned from the scRNA-seq data and the imputation was conducted based on a lazy random walk using the affinity network.
Third, my research team developed a method to identify differentially expressed genes between two groups of subjectsfrom single cell RNA sequencing data. This question is fundamentally different from the question of identifying differentially expressed genes from two groups of cells. Because the separation between different cell populations has been shown to be the dominant variation in single cell RNA sequencing data. Even though the hierarchical data structure, i.e. some cells are from the same subject, is not considered, cell population markers can still be easily found. For certain cell types, cells from the same subject cluster together rather than with cells of the same type but from different subjects. This indicate that for these cell types, subject effect is the second dominant variation and needs to be considered when differentially expressed genes between two groups of individuals are identified. We designed iDESC to consider this subject effect and the high dropout rate in scRNA-seq data for this task. The results demonstrated highly inflated type I error rate when subject effect is not considered and the superior performance of iDESC.
Fourth, the spatial-barcoding based spatial transcriptomic measure gene expression levels unbiasely but not at single-cell resolution. Each capture spot may have unknown number of cells of unknown types. A straightforward way to estimate the cell type composition of cells in each capture spot is to deconvolute the data using existing scRNA-seq data from the same tissue type. However, scRNA-seq data and spatial transcriptomic data have systematic difference from each other, making it incorrect to assume that the observed spatial transcriptomic data is a linear combination of the cell type-specific expression profile from scRNA-seq data. In addition, comparing to the total number of cell types in the scRNA-seq data, the number of cell types existing in each capture spot is usually quite small. Lastly, the cell type composition of neighboring capture spots is highly correlated with each other due to the continuous consistency of tissues. To address all these challenges, we designed SDePER to estimate cell type composition for each capture spot with considering of the platform effects, sparsity and spatial correlation. Based on the estimated cell type composition, SDePER also imputes the expression data for the spatial map at an enhanced resolution. This is the first method designed to use machine learning methods to correct for platform effects, which was shown to be more efficient than the batch correction methods and contribute the most to the performance boost.
Medical Research Interests
Public Health Interests
Academic Achievements & Community Involvement
Teaching & Mentoring
Mentoring
Siming Zheng
Postdoc2024 - PresentHuanhuan Wei
Postdoc2023 - 2026Yuening Zhang
Postdoc2023 - Present
News & Links
Media
- B). Heatmap showing the clustering results by KEGG pathways using MCLUST. The color represents the clustering assignment of each sample by the KEGG pathways. C). Pathway based distance matrix among the clusters. The color of entry represents the pathway based distance between the corresponding two samples. Red represents a small distance (samples are strongly related) and white represents longer distance showing the strength of the clusters (samples are weakly related). Samples within TEA cluster 3 are the most strongly related and most homogeneous, followed by cluster 1 and 2, respectively.
News
- August 21, 2024
Unique Immune Profile Identified in Fibrotic Hypersensitivity Pneumonitis
- March 21, 2023
Department of Internal Medicine Promotions and Reappointments
- February 08, 2022
Scientists Apply High-resolution, Single-cell Profiling to Understand Immune Response in Severe COVID-19
- June 23, 2021
Despite the challenges of COVID-19, Yale-PCCSM section members continued their work on scientific papers