# Jeffrey Townsend, Ph.D.

Associate Professor of Public Health (Biostatistics) and of Ecology and Evolutionary Biology; Director of Bioinformatics, Yale Center for Analytical Sciences

### Current Projects

We have many projects ongoing in the lab, covering topics summarized below, including many we have already published on and many that we have not. In particular, we have a lot of projects on the somatic evolution of cancer that are not yet in publications.

### Research Summary

**1. BIOINFORMATIC TOOLS FOR CANCER GENETICS AND EPIDEMIOLOGY**

Whole-exome sequencing has created tremendous potential for revealing the genetic basis and underlying molecular mechanisms of many forms of cancer. However, somatic mutations occur at a significant frequency within tumors of most cancer types, and identification of the mutations that are on the causative trajectory from normal tissue to cancerous tissue is challenging. We are making algorithmic advances in clustering across discrete linear sequences that facilitate maximum likelihood inference of model-averaged clustering in discrete linear sequences of somatic amino acid replacement mutations appearing within mutated genes, and applying evolutionary theory to the repeated evolution of cancer in whole-exome sequence data sets to reveal the level of clonal natural selection for cancer drivers.

**2. BIOSTATISTICAL ANALYSIS FOR NONLINEAR MATHEMATICAL MODELS OF THE EPIDEMIOLOGY OF DISEASE**

I am developing probabilistic statistical methodologies for the mathematical modeling of disease emergence and spread. For diverse reasons, data for estimation of epidemiological parameters is often sparse. Evaluating a model with the “best point estimate” of sparse data may convey a misleading certitude to policy makers basing decisions on deterministic models of disease outbreak, spread, and persistence. Conversely, policy makers who are aware that models are parameterized with limited data may be dismissive of deterministic predictions that yet have significant validity. We address these issues by probabilistic sensitivity analysis of parameters and full uncertainty analysis of outcomes of interest.

### Extensive Research Description

**1. TOOLS FOR CANCER GENETICS AND EPIDEMIOLOGY**

Whole-exome sequencing has created tremendous potential for revealing the genetic basis and underlying molecular mechanisms of many forms of cancer. However, somatic mutations occur at a significant frequency within tumors of most cancer types, and identification of the mutations that are on the causative trajectory from normal tissue to cancerous tissue is challenging. We are making algorithmic advances in clustering across discrete linear sequences to enact two powerful approaches to this identification. First, we are applying maximum likelihood approaches that we have developed for model-averaged clustering in discrete linear sequences to somatic amino acid replacement mutations appearing within mutated genes. Because amino acids of proteins that are functionally important are locally clustered in domains, mutations in multiple tumors that are functionally important to the development of cancer cluster in the linear sequence of relevant genes, allowing inference of relevance and function even in cases without three-dimensional protein structure. These clustering analyses have the power to demonstrate, for instance, cross-cancer consistency in the functional importance of the DNA binding domain of tumor suppressor p53, whether in a cancer with extensive exome data (ovarian serous adenocarcinoma) or in a cancer with much less extensive exome data (e.g. rectal adenocarcinoma).

Second, we are applying evolutionary theory to the problem of identification of the genetic architecture of underlying cancer development. The path from normal to cancerous tissue is navigated by an evolutionary process. Tools from evolutionary theory have the potential to parse those mutations that are selected within cells on the path to cancer from those mutations that arise incidentally during the somatic evolution of cancer. The theory we are applying makes use of differences in expectation for synonymous and replacement mutations. Synonymous mutations are expected to have no functional impact; thus they yield a proxy expectation for the “incidental” mutations, whereas carcinogenic replacement mutations will spread within tumors more frequently and are clustered within gene sequence. Our theory also employs human population polymorphism data, which most evolutionary biologists believe can be largely assumed to be neutral. This data facilitates calibration of the probable impact of replacement changes to sequence conservation by eliminating the confounding variable of the degree of purifying selection, which decreases the number of mutations observed in some genes and allows others to accumulate many mutations with little impact.

We are extending this approach to estimating selection intensity on mutations along the trajectory toward cancer, revealing the level of selection within tumors for replacement mutations compared to synonymous mutations. This evolutionary analysis is ideal for detecting the history of selection on sites within genes during the evolution of cancer from exome sequencing data. These sites, particularly when representing gain-of-function mutations, will help identify candidate loci for pharmacological intervention. This approach will be applied to identify targets for pharmacological intervention and design “personal genomics” drugs appropriate for the genetics of individual cancers in individual patients. As a component of that project, we are constructing an “active-experiment” cancer exome database to facilitate further bioinformatics investigation of cancer exome data.

2. BIOSTATISTICAL ANALYSIS FOR NONLINEAR MATHEMATICAL MODELS OF THE EPIDEMIOLOGY OF DISEASE

I am developing probabilistic statistical methodologies for the mathematical modeling of disease emergence and spread. Robustness of models has usually been assessed by techniques that explore the relative impact and importance of parameters upon the mathematical behavior of the function and the mathematical predictions of the model. For diverse reasons including the difficulty or cost of acquisition, restrictions due to privacy, and urgency of analysis in the case of outbreaks, data for estimation of epidemiological parameters is often sparse. Evaluating a model with the “best point estimate” of sparse data may convey a misleading certitude to policy makers basing decisions on deterministic models of disease outbreak, spread, and persistence. Conversely, policy makers who are aware that models are parameterized with limited data may be dismissive of deterministic predictions that yet have significant validity. These issues may be most straightforwardly addressed by probabilistic sensitivity analysis of parameters and full uncertainty analysis of outcomes of interest. These analyses amount to accommodating the uncertainty of parameters directly into an analysis by probabilistically resampling data or likely distributions of parameters to calculate a probabilistic distribution of outcomes.

For instance, one of the most common modeling approaches for evaluating interventions is based on differential equation models of disease such as the standard Susceptible-Infected-Recovered (SIR) model. In the SIR model and other more complex constructions, a closed-form solution can often be calculated for the basic reproductive number, *R _{0}*, the average number of secondary infections that would follow upon a primary infection in a naïve host population. In a population where there is preexisting immunity due to either vaccination or previous infection, the effective reproductive number,

*R*, is defined as the average number of secondary infections following a primary infection in a population that is not completely naïve.

_{e}is of particular interest in public health because interventions that bring its value below 1 are predicted to eradicate the disease. This deterministic threshold of is proposed as the basis for policy decisions regarding the level of interventions that should be implemented. However, the best estimates for the parameters that are needed for the closed-form solution of are inevitably inexact. To address this point, sensitivity analyses are frequently performed to evaluate models and explore the relationship between model parameters and outcomes. In such deterministic sensitivity analyses, one or more parameters are perturbed and the corresponding effects on outcomes are examined. The perturbation can be done either by evaluating the effect of arbitrarily small changes in parameter values (e.g. ± 1%) or by evaluating the effects across a range of values defined by plausible probability density functions. Because the values of other parameters are held fixed at best point estimates, these strategies do not account for interaction effects in non-linear dynamic models, and do not assess global uncertainty in outcome. Uncertainty analysis has been recommended for many fields of mathematical modeling, including medical decision making, as an optimal approach to presenting models. In the case of dynamic transmission modeling, however, authoritative best practices have not included uncertainty analyses. Modeling guidelines recommend probabilistic sensitivity analysis, in which both global parameter uncertainty and output uncertainty are addressed, as the best practice method for uncertainty analysis. Yet that ideal has not been extended to dynamic transmission models, for which its implementation has been challenging.

We are developing methods for global probabilistic sensitivity analysis that allow the contribution of each parameter to model outcomes to be investigated while also taking into account the uncertainty of other model parameters. Uncertainty in parameter values can be accounted for by sampling randomly from empirical data or from probability density functions fit to empirical data. Depending on the instance, such sampling techniques include bootstrapping, Monte Carlo sampling, and Latin hypercube sampling. The model output generated from parameter samples can then be analyzed using linear (e.g. partial correlation coefficients), monotonic (e.g. partial rank correlation coefficients) and non-monotonic statistical tests (e.g. sensitivity index) to determine the contribution of each parameter to the variation in output values. Indeed, for a global sensitivity analysis to yield probabilities associated with outcomes that are of greatest utility to policy makers, probabilistic analyses of parameter uncertainty must be carried through to the model outcomes. For example, the probability of eradication of an epidemic is sensitive to both levels of vaccination and treatment. Moreover, a policy based on the analysis of data should take into consideration not only the best estimate of necessary action, but also the uncertainty around that outcome estimate. The former policy advice, indicating an exact cline of treatment and vaccination that should put into abeyance an influenza epidemic, is very different and can be misleading compared to the probabilistic statement, which gives a policymaker a predictive probability that a particular policy of treatment and vaccination will put into abeyance an influenza epidemic. Similar approaches applied with a next-generation matrix to rabies vaccination in Tanzania were able to demonstrate that WHO goals in two districts of 70% vaccination coverage of dogs had more than enough probability to control rabies, if only the process to achieve those not impractical goals could be mustered.

A public health decision maker would find most useful the assignment of the probability of eradication to each level of treatment, so that they may precisely weigh the cost of intervention against the potential for failure. These probabilistic outcome distributions also feed forward extremely fluidly with cost-effectiveness estimation, a field which has embraced uncertainty analysis but which has until our recent work not incorporated uncertainty from nonlinear infectious disease models into calculations.

### Selected Publications

- Scarpino SV, Iamarino A, Wells C, Yamin D, Ndeffo-Mbah M, Wenzel NS, Fox SJ, Nyenswah T, Altice FL, Galvani AP, Meyers LA and Townsend JP: Epidemiological and viral genomic sequence analysis of the 2014 ebola outbreak reveals clustered transmission. Clin Infect Dis. 2015 Apr 1;60 (7) :1079-82. Epub 2014 Dec 15. PMID: 25516185
- Gilbert, J.A., L.A. Meyers, A.P. Galvani, and J.P. Townsend (2014). Probabilistic uncertainty analysis of epidemiological modeling to guide public health intervention policy. Epidemics 6: 37-45.
- McBride, R. C., Boucher, N., Park, D. S., Turner, P. E., & Townsend, J. P. (2013). Yeast response to LA virus indicates coadapted global gene expression during mycoviral infection. FEMS yeast research, 13(2), 162-179.
- Townsend J.P., Z. Su, and Y.I. Tekle (2012). Phylogenetic signal and noise: predicting the power of a dataset to resolve phylogeny. Systematic Biology 61(5): 835-849.
- Townsend, J. P., Bøhn, T., & Nielsen, K. M. (2012). Assessing the probability of detection of horizontal gene transfer events in bacterial populations. Frontiers in microbiology, 3.
- Tekle, Y. I., Nielsen, K. M., Liu, J., Pettigrew, M. M., Meyers, L. A., Galvani, A. P., & Townsend, J. P. (2012). Controlling Antimicrobial Resistance through Targeted, Vaccine-Induced Replacement of Strains. PloS one, 7(12), e50688.
- Zhang, Z., López-Giráldez, F., & Townsend, J. P. (2010). LOX: inferring Level Of eXpression from diverse methods of census sequencing. Bioinformatics, 26(15), 1918-1919. Chicago
- Zhang, Z., & Townsend, J. P. (2009). Maximum-likelihood model averaging to profile clustering of site types across discrete linear sequences. PLoS computational biology, 5(6), e1000421.
- Zhang, Z., Cheung, K. H., & Townsend, J. P. (2009). Bringing Web 2.0 to bioinformatics. Briefings in bioinformatics, 10(1), 1-10.
- Townsend, J. P., & Hartl, D. L. (2002). Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments. Genome Biol, 3(12), 0071.