Research & Publications
Our research takes place at the intersection of human genetics, genome biology, bioinformatics and data science. We are building new computational methods and data resources to better analyze human genomes, and using high-throughput DNA sequencing technologies to study genome variation and disease. We are interested in basic aspects of genome biology including the mutational causes and molecular consequences of genetic variation. We are conducting large-scale genome sequencing studies of human disease, with a current focus on gene discovery for coronary artery disease and related traits in multi-ethnic cohorts. We are especially interested in developing better tools to analyze genome structural variation and other difficult-to-detect variant types, and are directly measuring these variants in our disease studies. Common threads in our work include the application of advanced computational methods, systematic analysis of large, unbiased datasets, and a strong effort to include diverse human populations, both from the standpoint of gene discovery as well as methods development.
Extensive Research Description
Human genome structural variation. We are interested in basic questions related to the mutational causes and proximal molecular consequences of genome variation in populations, individuals and cells. We are especially interested in structural variation (SV), which includes large (≥50 bp) copy number variants (CNVs), mobile element insertions (MEIs) and genomic rearrangements. Although SVs are few in number compared to smaller-scale variants – each human carries ~10,000 SVs compared to ~3 million SNPs – they have more severe consequences on average due to their ability to alter gene dosage, disrupt gene function, or rearrange genes and regulatory elements. However, SVs have not been assessed in most disease studies due to technical challenges. We are pursuing three lines of research in this area. First, we are developing open-source bioinformatics tools for SV detection, genotyping, annotation, and impact prediction, to enable comprehensive genome analysis at the scale of human populations. Second, we are analyzing large-scale WGS datasets to characterize the landscape of SV in human populations. Knowledge of rare SV is limited relative to other variant classes, and this confounds SV interpretation in genetic studies and clinical efforts. We recently characterized SV in tens of thousands of human genomes, which produced a valuable resource for the community and revealed the contribution of deleterious SV to the rare variant burden, and we are now pursuing related work in larger and more diverse datasets. Finally, we are measuring the contribution of SV to human phenotypic variation, a hotly debated question with practical importance for the design of genetic association studies. We are directly assessing SVs in large-scale studies of cardiovascular disease (see below) and other common diseases, and we are studying the impact of genome variation on gene expression and other molecular traits across tissues and cells, as in our prior work from the GTEx project.
Genetics of coronary artery disease (CAD) and cardiometabolic traits. A major goal of our current work is to identify new variants and genes that contribute to cardiovascular disease. This is a relatively new area of research in my lab that we launched in 2016 as part of our NHGRI Center for Common Disease Genomics (CCDG) program, and is a close collaboration with Dr. Nathan Stitziel, a cardiologist at WashU. The main project is a case/control association study of early-onset CAD in ~40,000 individuals, where deep cardiometabolic trait measurements are available for many samples. Although there has been much prior work in this area using traditional GWAS, our approach has several advantages. First, we aim to comprehensively assess all forms of genome variation. Whereas standard GWAS interrogates a subset of variation using SNP arrays, our use of deep WGS allows us to study all variant classes, genome-wide, across the full allele frequency spectrum. Second, our multi-ethnic study focuses on unique and understudied populations that promise to carry novel risk alleles, including African Americans, Latinos and Finnish Europeans, each of which are informative for different reasons. African Americans exhibit high levels of genetic diversity and carry many variants that are absent in Europeans, and have not been included in most prior GWAS efforts. Admixed Latino populations carry a variable mixture of European, Native American and African haplotypes, and are also understudied. Finnish Europeans are the product of a unique population history that includes multiple ancient bottlenecks and a recent expansion, leading to an excess of deleterious low frequency variants in Finns that provide advantages for trait mapping. Finally, given that association studies alone are often not sufficient to identify causal variants, genes and mechanisms, we are leveraging single cell and multiomics data from relevant tissues and populations to help interpret the above studies.
Human Pangenome Project. The human reference genome is inarguably the most important and widely-used resource in the human genetics and genomics field. Yet, there is a growing realization that the current reference is inadequate to support the next generation of studies because it is a linear, haploid representation of haplotypes derived from multiple individuals, primarily of European descent, and thus does not adequately represent genetic diversity in the human population. This causes ancestry bias in key genomic applications, which can propagate to clinical assays and contribute to health disparities. We are co-leading a multi-site NHGRI-funded collaboration launched in 2019 that is building a human reference “pangenome” to replace the current reference (GRCh38). Our work in this project is focused on (1) building high quality genome assemblies from several hundred ancestrally diverse individuals using long-read data, (2) characterizing the full extent of genome variation in these assemblies, (3) representing these assemblies and variants in a pangenome graph that can be used for downstream applications, and (4) building next-generation computational tools and pipelines that leverage these data structures to enable comprehensive and unbiased genomic analyses.
Genomic data science. In addition to the work described above, we are working on several additional projects at the intersection of human genetics and data science. Two key challenges for human genetics research are sample size and data sharing. We are developing and applying methods for data aggregation, sharing and cloud-based analysis. We previously developed the “functional equivalence” data processing standard to enable harmonized analysis across genome sequencing studies, which alleviates the strong batch effects that would otherwise confound joint analyses and is now in use at most large genome centers worldwide. We are core members of the AnVIL project, a multi-site collaboration that is building a cloud-based data sharing and analysis platform that will store and provide access to vast amounts of genomic data generated by NHGRI and other NIH institutes. We are also mining data from various national and international research projects and biobanks, in order to increase sample size and provide replication for the human genetics studies described above, and to increase power for local biobanking and precision medicine efforts such as the Yale Generations Project.
Genetics; Genomics; Human Genetics; Data Science