Featured Publications
A framework for understanding selection bias in real-world healthcare data
Kundu R, Shi X, Morrison J, Barrett J, Mukherjee B. A framework for understanding selection bias in real-world healthcare data. Journal Of The Royal Statistical Society Series A (Statistics In Society) 2024, 187: 606-635. PMID: 39281782, PMCID: PMC11393555, DOI: 10.1093/jrsssa/qnae039.Peer-Reviewed Original ResearchElectronic health recordsSelection biasAssociation of cancerMultiple sources of biasHealth recordsHealthcare systemSources of biasReal-world healthcare dataBinary outcomesEstimation of associated parametersHealthcare dataReal-world dataPotential biasSample sizeStandard errorData exampleVariance formulaAnalysis of real-world dataAssociationSimulation studyWeighting approachBiological sexAssociated parametersBiasMultiple sources
2024
Improving prediction of linear regression models by integrating external information from heterogeneous populations: James–Stein estimators
Han P, Li H, Park S, Mukherjee B, Taylor J. Improving prediction of linear regression models by integrating external information from heterogeneous populations: James–Stein estimators. Biometrics 2024, 80: ujae072. PMID: 39101548, PMCID: PMC11299067, DOI: 10.1093/biomtc/ujae072.Peer-Reviewed Original ResearchConceptsJames-Stein estimatorLinear regression modelsIndividual-level dataComprehensive simulation studyRegression modelsNumerical performanceSimulation studyShrinkage methodCoefficient estimatesPredictive meanReduced modelStudy population heterogeneityInternal modelEstimationStudy populationBlood lead levelsInternational studiesCovariatesPatella bonePublished literatureLead levelsExternal studiesSummary informationPopulationSubsets
2023
An inverse probability weighted regression method that accounts for right‐censoring for causal inference with multiple treatments and a binary outcome
Yu Y, Zhang M, Mukherjee B. An inverse probability weighted regression method that accounts for right‐censoring for causal inference with multiple treatments and a binary outcome. Statistics In Medicine 2023, 42: 3699-3715. PMID: 37392070, DOI: 10.1002/sim.9826.Peer-Reviewed Original ResearchConceptsRight censoringWeighted score functionCausal treatment effectsAverage treatment effectAsymptotic propertiesCensored componentPre-specified time windowEstimation consistencyRobustness propertiesSimulation studyBinary outcomesPresence of confoundersCensoringScoring functionInverse probabilityTreatment effectsEstimationSources of biasInferenceLetter CComparative effectiveness researchTreatment switchRegression methodLogistic regression modelsInsurance claims database
2022
Methods for large‐scale single mediator hypothesis testing: Possible choices and comparisons
Du J, Zhou X, Clark‐Boucher D, Hao W, Liu Y, Smith J, Mukherjee B. Methods for large‐scale single mediator hypothesis testing: Possible choices and comparisons. Genetic Epidemiology 2022, 47: 167-184. PMID: 36465006, PMCID: PMC10329872, DOI: 10.1002/gepi.22510.Peer-Reviewed Original ResearchConceptsNull hypothesisTest statisticsMediation hypothesis testingComposite null hypothesisHypothesis testingClasses of methodsFalse positive rateAlternative hypothesisSimulation studyHypothesis testing methodContinuous mediatorReference distributionSobel test statisticsContinuous outcomesExposure-mediator interactionMulti-Ethnic Study of AtherosclerosisDNA methylation sitesClassCRANMethylation sitesIncorporating family disease history and controlling case–control imbalance for population-based genetic association studies
Zhuang Y, Wolford B, Nam K, Bi W, Zhou W, Willer C, Mukherjee B, Lee S. Incorporating family disease history and controlling case–control imbalance for population-based genetic association studies. Bioinformatics 2022, 38: 4337-4343. PMID: 35876838, PMCID: PMC9477535, DOI: 10.1093/bioinformatics/btac459.Peer-Reviewed Original ResearchConceptsEmpirical saddlepoint approximationFamily disease historyCase-control imbalanceSaddlepoint approximationGenome-wide association analysisPopulation-based genetic association studiesGenetic association testsVariant-phenotype associationsDisease historyGenetic association studiesLow detection powerType I error inflationCorrelation of phenotypesWhite British sampleSupplementary dataAssociation studiesPopulation-based biobanksIncreased phenotypic correlationsKorean GenomeSimulation studyPhenotype distributionPhenotypeAssociation TestBioinformaticsPhenotypic correlations
2021
A comparison of parametric propensity score‐based methods for causal inference with multiple treatments and a binary outcome
Yu Y, Zhang M, Shi X, Caram M, Little R, Mukherjee B. A comparison of parametric propensity score‐based methods for causal inference with multiple treatments and a binary outcome. Statistics In Medicine 2021, 40: 1653-1677. PMID: 33462862, DOI: 10.1002/sim.8862.Peer-Reviewed Original ResearchConceptsComparative effectiveness researchEstimation of causal effectsPropensity score-based methodsBinary outcomesInsurance networksCausal effectsPropensity score methodsPropensity-based methodsConfounding biasContinuous outcomesPharmacy claimsEffectiveness researchObservational studySimulation studyAdverse outcomesPropensity scoreEmergency room
2020
Methods to Account for Uncertainty in Latent Class Assignments When Using Latent Classes as Predictors in Regression Models, with Application to Acculturation Strategy Measures.
Elliott M, Zhao Z, Mukherjee B, Kanaya A, Needham B. Methods to Account for Uncertainty in Latent Class Assignments When Using Latent Classes as Predictors in Regression Models, with Application to Acculturation Strategy Measures. Epidemiology 2020, 31: 194-204. PMID: 31809338, PMCID: PMC7480960, DOI: 10.1097/ede.0000000000001139.Peer-Reviewed Original ResearchConceptsMeasurement error modelJoint modelRegression parametersLatent classesLikelihood-basedLatent class modelSimulation studyClass modelTwo-stage modelClassError modelPrimary interestAcculturation behaviorsMeasurement errorSouth Asian immigrantsLatent class analysisAsian immigrantsTrue classUncertaintyClass analysisEstimationStrategy measuresInteraction analysis under misspecification of main effects: Some common mistakes and simple solutions
Zhang M, Yu Y, Wang S, Salvatore M, Fritsche L, He Z, Mukherjee B. Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions. Statistics In Medicine 2020, 39: 1675-1694. PMID: 32101638, DOI: 10.1002/sim.8505.Peer-Reviewed Original ResearchConceptsType I error rateType I error inflationIndependence assumptionWald and score testsCorrect type I error ratesSandwich variance estimatorSandwich estimatorScore testVariance estimationSimulation studyMisspecificationMichigan Genomics InitiativeStatistical practiceBinary outcomesTested interactionsEmpirical factsFlexible modelData modelTest of interactionBiobank studyInflationAssumptionsContinuous outcomesEpidemiological literatureLinear regression models
2019
A Fast and Accurate Method for Genome-wide Scale Phenome-wide G × E Analysis and Its Application to UK Biobank
Bi W, Zhao Z, Dey R, Fritsche L, Mukherjee B, Lee S. A Fast and Accurate Method for Genome-wide Scale Phenome-wide G × E Analysis and Its Application to UK Biobank. American Journal Of Human Genetics 2019, 105: 1182-1192. PMID: 31735295, PMCID: PMC6904814, DOI: 10.1016/j.ajhg.2019.10.008.Peer-Reviewed Original ResearchConceptsCase-control ratioGenome-wide significance levelMeasures of environmental exposureGenome-wide analysisEuropean ancestry samplesGenetic association studiesSaddlepoint approximationCase-control imbalanceAnalysis of phenotypesGene-environment interactionsPopulation-based biobanksControlled type I error ratesAssociation studiesG x E effectsUK BiobankType I error rateGenetic variantsE analysisSPAGEComplex diseasesEnvironmental exposuresTest statisticsE studySimulation studyWald testEstimating Outcome-Exposure Associations when Exposure Biomarker Detection Limits vary Across Batches.
Boss J, Mukherjee B, Ferguson K, Aker A, Alshawabkeh A, Cordero J, Meeker J, Kim S. Estimating Outcome-Exposure Associations when Exposure Biomarker Detection Limits vary Across Batches. Epidemiology 2019, 30: 746-755. PMID: 31299670, PMCID: PMC6677587, DOI: 10.1097/ede.0000000000001052.Peer-Reviewed Original ResearchConceptsBinary outcome dataLikelihood-based methodsComplete-case analysisDistributional assumptionsAssignment of samplesSuperior estimation propertiesSimulation studyComplete-caseMultiple imputation strategyExposure dataMultiple batchesBatch assignmentEstimated propertiesLimit-variablesSingle imputationMultiple imputationCohort studySynthetic data method to incorporate external information into a current study
Gu T, Taylor J, Cheng W, Mukherjee B. Synthetic data method to incorporate external information into a current study. Canadian Journal Of Statistics 2019, 47: 580-603. PMID: 32773922, PMCID: PMC7410329, DOI: 10.1002/cjs.11513.Peer-Reviewed Original ResearchSynthetic data methodsDataset of sizeSynthetic data approachB modelMaximum likelihood estimation approachAsymptotic varianceGeneral regression contextSize nRegression contextSimulation studyVariable BEstimation approachDiverse scenariosCancer Prevention TrialExternal informationIndividual level dataDatasetProstate Cancer Prevention TrialPrevention trials
2018
Distributed Lag Interaction Models with Two Pollutants
Chen Y, Mukherjee B, Berrocal V. Distributed Lag Interaction Models with Two Pollutants. Journal Of The Royal Statistical Society Series C (Applied Statistics) 2018, 68: 79-97. PMID: 30636815, PMCID: PMC6328049, DOI: 10.1111/rssc.12297.Peer-Reviewed Original ResearchMean square errorEffects of air pollutionDistributed lag modelsAir pollution studiesHealth outcomesNational MorbidityBias-variance tradeoffEnvironmental epidemiologyAir pollutionPollution studiesPollutionLag effectMortality countsMain effectTensor productShrinkage methodShrinkage versionAverage performanceNatural waySimulation studyJoint effectsInteraction structureMortalityNMMAPSMorbidityEmpirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability
Estes J, Mukherjee B, Taylor J. Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability. Statistics In Biosciences 2018, 10: 568-586. PMID: 31123532, PMCID: PMC6529204, DOI: 10.1007/s12561-018-9217-4.Peer-Reviewed Original ResearchEmpirical Bayes estimatorsSummary-level informationConstrained maximum likelihoodBayes estimatorsEmpirical Bayes shrinkage estimatorsSimulation studyBayes shrinkage estimatorShrinkage estimatorsLikelihood estimationCovariate distributionsConditional probability distributionData applicationsTrade biasMaximum likelihoodProbability distributionLoss of efficiencyCancer Prevention TrialIndividual-level dataEstimationProstate Cancer Prevention TrialPrevention trialsInternational populationSubset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes
Yu Y, Xia L, Lee S, Zhou X, Stringham H, Boehnke M, Mukherjee B. Subset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes. Human Heredity 2018, 83: 283-314. PMID: 31132756, PMCID: PMC7034441, DOI: 10.1159/000496867.Peer-Reviewed Original ResearchMeSH KeywordsCase-Control StudiesCholesterolCohort StudiesComputer SimulationC-Reactive ProteinFinlandGene FrequencyGene-Environment InteractionGenetic Predisposition to DiseaseGenome-Wide Association StudyHumansLipoproteins, LDLMeta-Analysis as TopicModels, GeneticPhenotypePolymorphism, Single NucleotideConceptsPresence of G-E interactionsGenetic associationHeterogeneity of genetic effectsDiscovery of genetic associationsGene-environment (G-EMarginal genetic effectsG-E interactionsGenome-wide association studiesGene-environment interactionsGenetic effectsData examplesSimulation studySingle nucleotide polymorphismsGene-environmentAssociation studiesAssociation analysisScreening toolMarginal associationNucleotide polymorphismsPresence of heterogeneityAssociationEnvironmental factorsIncreased powerMultiple studiesG-E
2017
Robust distributed lag models using data adaptive shrinkage
Chen Y, Mukherjee B, Adar S, Berrocal V, Coull B. Robust distributed lag models using data adaptive shrinkage. Biostatistics 2017, 19: 461-478. PMID: 29040386, PMCID: PMC6454578, DOI: 10.1093/biostatistics/kxx041.Peer-Reviewed Original ResearchConceptsDistributed lag modelsDistributed LagLag modelTime series dataEffects of air pollutionBias-variance trade-offGeneralized ridge regressionShrinkage methodAir pollution studiesHierarchical Bayes approachShrinkage approachTime seriesDl functionAir pollutionPollution studiesEffect estimatesTrade-offsExtensive simulation studyDependent variableShrinking coefficientsMean square errorLagSimulation studyBayes approachRidge regressionMeta‐analysis of gene‐environment interaction exploiting gene‐environment independence across multiple case‐control studies
Estes J, Rice J, Li S, Stringham H, Boehnke M, Mukherjee B. Meta‐analysis of gene‐environment interaction exploiting gene‐environment independence across multiple case‐control studies. Statistics In Medicine 2017, 36: 3895-3909. PMID: 28744888, PMCID: PMC5624850, DOI: 10.1002/sim.7398.Peer-Reviewed Original ResearchMeSH KeywordsAge FactorsAlpha-Ketoglutarate-Dependent Dioxygenase FTOBayes TheoremBiasBiometryBody Mass IndexCase-Control StudiesComputer SimulationDiabetes Mellitus, Type 2Gene-Environment InteractionHumansLogistic ModelsMeta-Analysis as TopicModels, GeneticModels, StatisticalPolymorphism, Single NucleotideRetrospective StudiesConceptsGene-environment independenceGene-environmentEmpirical Bayes estimatorsGene-environment interactionsCase-control studyMeta-analysis settingBayes estimatorsRetrospective likelihood frameworkShrinkage estimatorsMeta-analysisTesting gene-environment interactionsCombination of estimatesFactors body mass indexSimulation studyBody mass indexUnconstrained modelLikelihood frameworkInverse varianceMeta-analysis frameworkFTO geneMass indexGenetic markersEstimationStandard alternativeChatterjee
2016
A new variance component score test for testing distributed lag functions with applications in time series analysis
Chen Y, Mukherjee B. A new variance component score test for testing distributed lag functions with applications in time series analysis. Statistics & Probability Letters 2016, 123: 122-127. PMID: 29200542, PMCID: PMC5703603, DOI: 10.1016/j.spl.2016.12.003.Peer-Reviewed Original ResearchTests for Gene-Environment Interactions and Joint Effects With Exposure Misclassification
Boonstra P, Mukherjee B, Gruber S, Ahn J, Schmit S, Chatterjee N. Tests for Gene-Environment Interactions and Joint Effects With Exposure Misclassification. American Journal Of Epidemiology 2016, 183: 237-247. PMID: 26755675, PMCID: PMC4724093, DOI: 10.1093/aje/kwv198.Peer-Reviewed Original ResearchConceptsG-E interactionsPresence of exposure misclassificationExposure misclassificationImpact of exposure misclassificationGene-environment (G-EGene-environment interactionsGenome-wide levelGenome-wide searchGenome-wide testingGenetic susceptibility lociJoint testDisease-gene relationshipsGene-environmentGenetic risk factorsType I error rateFamily-wise type I error rateSusceptibility lociG-EGenetic associationRisk factorsStatistical powerJoint effectsSimulation studyMisclassificationPublished simulation studies
2014
A data-adaptive strategy for inverse weighted estimation of causal effects
Zhu Y, Ghosh D, Mitra N, Mukherjee B. A data-adaptive strategy for inverse weighted estimation of causal effects. Health Services And Outcomes Research Methodology 2014, 14: 69-91. DOI: 10.1007/s10742-014-0124-y.Peer-Reviewed Original ResearchEstimation of causal effectsData analysis examplesAverage treatment effectNonparametric modelSimulation studyTheoretical resultsPropensity scoreEffect of confoundersMeasured covariatesWeight estimationCausal effectsNonrandomized observational studyTreatment effectsLogistic regressionObservational studyAnalysis exampleRandomized trialsConfoundingExamplesScoresCovariatesInferenceEstimation
2013
Bayesian shrinkage methods for partially observed data with many predictors
Boonstra P, Mukherjee B, Taylor J. Bayesian shrinkage methods for partially observed data with many predictors. The Annals Of Applied Statistics 2013, 7: 2272-2292. PMID: 24436727, PMCID: PMC3891514, DOI: 10.1214/13-aoas668.Peer-Reviewed Original ResearchFraction of missing informationOptimal bias-variance tradeoffBayesian shrinkage methodsEmpirical Bayes algorithmComprehensive simulation studyBias-variance tradeoffSurrogate covariatesSimulation studyShrinkage methodCovariatesPrediction problemState-of-the-artModel parametersProblemMissing dataLung cancer datasetBayes algorithmState-of-the-art technologiesArray technologyCancer datasetsQRT-PCR