A Mixture Model Approach for the Analysis of Microarray Gene Expression Data
B. Allison, Gary L. Gadbury, Moonseong Heo, José R. Fernández,Cheol-Koo Lee, Tomas A. Prolla, Richard Weindruch. (2002). Computational Statistics & Data Analysis (in press).
For each gene, we often wish to answer the question "Is the true difference in expression between two or more groups statistically significant?" By "true difference" we mean an average difference in a gene expression level between populations from which the two sample groups are obtained. However, there are several other important and interesting questions including: . Is there statistically significant evidence that any of the genes
under study exhibit a difference in expression across the groups. What is the best estimate of the number of genes for which there is a true difference in expression? What is the confidence interval around that estimate? If we set some threshold above which we declare results for a particular gene 'interesting' and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads? What proportion of those genes not declared 'interesting' are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)?
We have developed procedures that allow formal answers to the questions proposed above. This set of procedures is based on the idea that when many statistical tests are conducted, one obtains a distribution of test statistics and corresponding p-values and that there is exploitable information available in this distribution. If there is no mean difference in expression between two groups for any gene, p-values should appear uniformly distributed on the interval [0,1], as long as the p-values are considered independent. In contrast, if there exists a true mean difference for some genes, the distribution of p-values will tend to cluster closer to zero than to one. By fitting a mixture of a uniform distribution and one or more beta distributions (the beta is just a very flexible distribution used for convenience) to the observed collection of p-values, we can determine if a mixture model that includes components beyond a uniform distribution fits the data significantly better than a single uniform distribution. If this is indeed the case, we can answer "yes" to the question "Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the groups?", and proceed to estimate the number of such genes. Significance testing and construction of confidence intervals is performed via non-parametric bootstrap or simulation procedures. The effect of non-independence of p-values can be assessed via the same procedures, primarily via simulation studies. Once the model is fitted, we can use it to answer additional questions such as, "for each gene and for the p-value that was computed for that gene, what is the probability that the gene is a gene for which there is a true difference in gene expression?" The answer is referred to as a posterior probability computed from a technique called Bayes Rule. Again, by "true difference" we mean a difference at the population level as opposed to simply an observed difference in a sample that could be due to either measurement error or sampling variability.