Hua Xu, PhD
Cards
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
About
Titles
Robert T. McCluskey Professor of Biomedical Informatics and Data Science
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science; Assistant Dean for Biomedical Informatics, Yale School of Medicine
Biography
Dr. Hua Xu is a well-known researcher in clinical natural language processing (NLP). He has developed novel algorithms for important clinical NLP tasks such as entity recognition and relation extraction, which have been top ranked in over a dozen of international biomedical NLP challenges. His lab has developed CLAMP, a comprehensive clinical NLP toolkit that has been successfully commercialized and used by hundreds of healthcare organizations. Moreover, he has led multiple national/international initiatives (e.g., Chair of the NLP working group at Observational Health Data Sciences and Informatics - OHDSI program) to apply developed NLP technologies to diverse clinical and translational studies, thus greatly accelerating clinical evidence generation using electronic health records data. Recently, he also utilizes NLP to harmonize metadata of biomedical digital objects (e.g., indexing millions of biomedical datasets to make them findable), with the goal to promote FAIR principles in biomedicine. Currently Dr. Xu's lab is actively working on developing large language models for diverse biomedical applications. See more information about Dr. Xu's lab here.
Appointments
Biomedical Informatics & Data Science
ProfessorPrimary
Other Departments & Organizations
Education & Training
- PhD
- Columbia University, Biomedical Informatics
- MS
- New Jersey Institute of Technology, Computer Science
- BS
- Nanjing University, Biochemistry
Research
Overview
Medical Research Interests
ORCID
0000-0002-5274-4672- View Lab Website
Clinical NLP Lab
Research at a Glance
Yale Co-Authors
Publications Timeline
Research Interests
Lucila Ohno-Machado, MD, MBA, PhD
Vipina K. Keloth, PhD
Qingyu Chen, PhD
Tsung-Ting Kuo, PhD
Huan He, PhD
Jihoon Kim, PhD
Natural Language Processing
Publications
2025
Medical foundation large language models for comprehensive text analysis and beyond
Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, Peng X, Huang J, Zhang J, Keloth V, Zhou X, Qian L, He H, Shung D, Ohno-Machado L, Wu Y, Xu H, Bian J. Medical foundation large language models for comprehensive text analysis and beyond. Npj Digital Medicine 2025, 8: 141. PMID: 40044845, PMCID: PMC11882967, DOI: 10.1038/s41746-025-01533-1.Peer-Reviewed Original ResearchConceptsText analysis tasksAnalysis tasksLanguage modelDomain-specific knowledgeZero-ShotHuman evaluationSupervised settingTask-specific instructionsClinical data sourcesSpecialized medical knowledgeChatGPTText analysisPretrainingTaskData sourcesMedical applicationsMedical knowledgeEnhanced performanceTextPerformanceImproving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media
Li Y, Viswaroopan D, He W, Li J, Zuo X, Xu H, Tao C. Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media. Journal Of Biomedical Informatics 2025, 163: 104789. PMID: 39923968, DOI: 10.1016/j.jbi.2025.104789.Peer-Reviewed Original ResearchMeSH Keywords and ConceptsConceptsTraditional deep learning modelsDeep learning modelsRecurrent neural networkLearning modelsEntity recognitionLanguage modelF1 scoreEnsemble of deep learningAdvances of natural language processingEffectiveness of ensemble methodsMicro-averaged F1Bidirectional Encoder RepresentationsExtensive labeled dataNatural language processingFine-tuned modelsBiomedical text miningFeature representationEncoder RepresentationsEvent extractionEntity typesText dataDeep learningSequential dataGPT-2Neural networkEvaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies
Jing N, Lu Y, Tong J, Weaver J, Ryan P, Xu H, Chen Y. Evaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies. Journal Of Biomedical Informatics 2025, 163: 104787. PMID: 39904407, DOI: 10.1016/j.jbi.2025.104787.Peer-Reviewed Original ResearchMeSH Keywords and ConceptsConceptsType I errorIntegrated likelihood estimatorsElectronic health recordsUse-case analysisLikelihood estimationLow prevalence outcomesUse-casesBias reductionNaive methodEffect sizeSynthetic dataPhenotyping algorithmsEstimation biasReal-world scenariosStatistical inferenceSimulation studyAssociation effect sizesAccurate prior informationBinary outcomesPoint estimatesAssociation estimatesStatistical powerHealth recordsKnowledge-guidedOutcome prevalenceEnvironment scan of generative AI infrastructure for clinical and translational science
Idnay B, Xu Z, Adams W, Adibuzzaman M, Anderson N, Bahroos N, Bell D, Bumgardner C, Campion T, Castro M, Cimino J, Cohen I, Dorr D, Elkin P, Fan J, Ferris T, Foran D, Hanauer D, Hogarth M, Huang K, Kalpathy-Cramer J, Kandpal M, Karnik N, Katoch A, Lai A, Lambert C, Li L, Lindsell C, Liu J, Lu Z, Luo Y, McGarvey P, Mendonca E, Mirhaji P, Murphy S, Osborne J, Paschalidis I, Harris P, Prior F, Shaheen N, Shara N, Sim I, Tachinardi U, Waitman L, Wright R, Zai A, Zheng K, Lee S, Malin B, Natarajan K, Price II W, Zhang R, Zhang Y, Xu H, Bian J, Weng C, Peng Y. Environment scan of generative AI infrastructure for clinical and translational science. Npj Health Systems 2025, 2: 4. PMID: 39872195, PMCID: PMC11762411, DOI: 10.1038/s44401-024-00009-w.Peer-Reviewed Original ResearchConceptsInformation technology staffData securityGenerative AIClinician trustTechnology staffAI biasAI infrastructureEnvironment scanningNational Institutes of HealthNational Center for Advancing Translational SciencesComprehensive environmental scanCTSA programInfrastructureTranslational scienceSecurityDeploymentInstitutes of HealthNetworkEnvironmental scanCoordinated approachBiomedRAG: A retrieval augmented large language model for biomedicine
Li M, Kilicoglu H, Xu H, Zhang R. BiomedRAG: A retrieval augmented large language model for biomedicine. Journal Of Biomedical Informatics 2025, 162: 104769. PMID: 39814274, PMCID: PMC11837810, DOI: 10.1016/j.jbi.2024.104769.Peer-Reviewed Original ResearchMeSH Keywords and Concepts
2024
OncoSplicing 3.0: an updated database for identifying RBPs regulating alternative splicing events in cancers
Zhang Y, Liu K, Xu Z, Li B, Wu X, Fan R, Yao X, Wu H, Duan C, Gong Y, Chen K, Zeng J, Li L, Xu H. OncoSplicing 3.0: an updated database for identifying RBPs regulating alternative splicing events in cancers. Nucleic Acids Research 2024, 53: d1460-d1466. PMID: 39558172, PMCID: PMC11701682, DOI: 10.1093/nar/gkae1098.Peer-Reviewed Original ResearchConceptsRNA-binding proteinsAlternative splicing eventsAS eventsSplicing eventsAlternative splicingPotential RNA-binding proteinsRegulate alternative splicing eventsTCGA cancersRNA-binding motifRNA-seq dataRegulate gene expressionMRNA expression dataECLIP-seqGTEx tissuesENCODE projectAbnormal alternative splicingIntron sequencesSplicing analysisRNA-seqExpression dataProtein complexesMinigene constructsSplicingGene expressionPerturbation experimentsTowards Enhanced Topic Discovery on Semantic Maps for Biomedical Literature Exploration
Choi B, Ondov B, He H, Xu H. Towards Enhanced Topic Discovery on Semantic Maps for Biomedical Literature Exploration. 2024, 00: 25-27. DOI: 10.1109/vahc65315.2024.00015.Peer-Reviewed Original ResearchConceptsSemantic mapHierarchical topic modelTF-IDF methodCentroid-based methodHierarchical topicsTopic discoveryGrowth of biomedical researchLabel generationTopic treeTopic modelsOverwhelming volumeNovel methodHierarchical clusteringPublic distributionLiterature explorationSemanticsMapsEnhanced visualizationHDBSCANTopicsLabelingMethodRepresentationVisualizationVolume of literatureSirtuin1 Suppresses Calcium Oxalate Nephropathy via Inhibition of Renal Proximal Tubular Cell Ferroptosis Through PGC‐1α‐mediated Transcriptional Coactivation
Duan C, Li B, Liu H, Zhang Y, Yao X, Liu K, Wu X, Mao X, Wu H, Xu Z, Zhong Y, Hu Z, Gong Y, Xu H. Sirtuin1 Suppresses Calcium Oxalate Nephropathy via Inhibition of Renal Proximal Tubular Cell Ferroptosis Through PGC‐1α‐mediated Transcriptional Coactivation. Advanced Science 2024, 11: 2408945. PMID: 39498889, PMCID: PMC11672264, DOI: 10.1002/advs.202408945.Peer-Reviewed Original ResearchConceptsCrystal-induced kidney injuryPGC-1aSingle-cell transcriptome sequencingNuclear factor erythroid 2-related factor 2Resistance to ferroptosisKidney injuryTranscriptional coactivatorTranscriptome sequencingRenal tubular epithelial cell injuryCalcium oxalate nephropathyPromoter regionRenal proximal tubular cellsTubular epithelial cell injuryEpithelial cell injuryProximal tubular cellsFactor erythroid 2-related factor 2Erythroid 2-related factor 2Oxalate nephropathyCell ferroptosisSIRT1Crystal nephropathyFerroptosisTubular cellsGPX4 transcriptionTherapeutic targetSEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials
Lee K, Paek H, Huang L, Hilton C, Datta S, Higashi J, Ofoegbu N, Wang J, Rubinstein S, Cowan A, Kwok M, Warner J, Xu H, Wang X. SEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials. Informatics In Medicine Unlocked 2024, 50: 101589. PMID: 39493413, PMCID: PMC11530223, DOI: 10.1016/j.imu.2024.101589.Peer-Reviewed Original ResearchConceptsAntibody-drug conjugatesOverall response rateMultiple myelomaF1 scoreCAR-TComplete responseBispecific antibodiesComparative performance analysisClinical trial studyClinical trial outcomesLanguage modelAccurate data extractionTherapy subgroupFine granularityOncology clinical trialsAdverse eventsClinical decision-makingPerformance analysisClinical trialsInnovative therapiesDiverse therapiesClinical trial abstractsCancer domainData elementsTherapyImproving tabular data extraction in scanned laboratory reports using deep learning models
Li Y, Wei Q, Chen X, Li J, Tao C, Xu H. Improving tabular data extraction in scanned laboratory reports using deep learning models. Journal Of Biomedical Informatics 2024, 159: 104735. PMID: 39393477, DOI: 10.1016/j.jbi.2024.104735.Peer-Reviewed Original ResearchConceptsTree edit distanceOptical character recognitionTable recognitionDeep learning modelsAverage recallAverage precisionState-of-the-art deep learning modelsLearning modelsRegion-of-interest detectionState-of-the-artCharacter recognitionDetection evaluationTree editingTabular dataImpressive resultsLab test resultsLaboratory test reportsClinical documentationRecognitionLaboratory reportsHealthcare organizationsClinical data analysisDecision makingClinical decision makingTest reports
News
News
- March 18, 2025
Connecticut Academy of Science and Engineering Elects 12 From YSM
- February 28, 2025
Yale Researchers Use Large Language Models to Detect Gastrointestinal Bleeding
- February 27, 2025
Expanded Access to MarketScan Data for Yale Researchers
- December 11, 2024Source: wfsb
Researchers at Yale Working with A.I. to Make Getting a Second Opinion Easier
Get In Touch
Contacts
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States