Hua Xu, PhD
Cards
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine
Contact Info
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States
About
Titles
Robert T. McCluskey Professor of Biomedical Informatics and Data Science
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science; Assistant Dean for Biomedical Informatics, Yale School of Medicine
Biography
Dr. Hua Xu is a well-known researcher in clinical natural language processing (NLP). He has developed novel algorithms for important clinical NLP tasks such as entity recognition and relation extraction, which have been top ranked in over a dozen of international biomedical NLP challenges. His lab has developed CLAMP, a comprehensive clinical NLP toolkit that has been successfully commercialized and used by hundreds of healthcare organizations. Moreover, he has led multiple national/international initiatives (e.g., Chair of the NLP working group at Observational Health Data Sciences and Informatics - OHDSI program) to apply developed NLP technologies to diverse clinical and translational studies, thus greatly accelerating clinical evidence generation using electronic health records data. Recently, he also utilizes NLP to harmonize metadata of biomedical digital objects (e.g., indexing millions of biomedical datasets to make them findable), with the goal to promote FAIR principles in biomedicine. Currently Dr. Xu's lab is actively working on developing large language models (LLMs) for diverse biomedical applications. See more information about Dr. Xu's lab here.
Appointments
Biomedical Informatics & Data Science
ProfessorPrimary
Other Departments & Organizations
Education & Training
- PhD
- Columbia University, Biomedical Informatics
- MS
- New Jersey Institute of Technology, Computer Science
- BS
- Nanjing University, Biochemistry
Research
Overview
Medical Research Interests
ORCID
0000-0002-5274-4672- View Lab Website
Clinical NLP Lab
Research at a Glance
Yale Co-Authors
Publications Timeline
Research Interests
Lucila Ohno-Machado, MD, MBA, PhD
Vipina K. Keloth, PhD
Qingyu Chen, PhD
Tsung-Ting Kuo, PhD
Huan He, PhD
Jihoon Kim, PhD
Natural Language Processing
Publications
Featured Publications
Benchmarking large language models for biomedical natural language processing applications and recommendations
Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer M, Ai X, Lai P, Wang Z, Keloth V, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng W, Adelman R, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications 2025, 16: 3280. PMID: 40188094, PMCID: PMC11972378, DOI: 10.1038/s41467-025-56989-2.Peer-Reviewed Original ResearchCitationsAltmetricMeSH Keywords and ConceptsConceptsLanguage modelNatural language processing applicationsBiomedical natural language processingMedical question answeringLanguage processing applicationsNatural language processingGrowth of biomedical literatureMissing informationFew-shotQuestion answeringZero-ShotKnowledge curationLanguage processingProcessing applicationsBioNLPBART modelPerformance gapBiomedical literatureGeneral domainTaskBenchmarksBERTInformationPerformanceLLMMedical foundation large language models for comprehensive text analysis and beyond
Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, Peng X, Huang J, Zhang J, Keloth V, Zhou X, Qian L, He H, Shung D, Ohno-Machado L, Wu Y, Xu H, Bian J. Medical foundation large language models for comprehensive text analysis and beyond. Npj Digital Medicine 2025, 8: 141. PMID: 40044845, PMCID: PMC11882967, DOI: 10.1038/s41746-025-01533-1.Peer-Reviewed Original ResearchCitationsAltmetricConceptsText analysis tasksAnalysis tasksLanguage modelDomain-specific knowledgeZero-ShotHuman evaluationSupervised settingTask-specific instructionsClinical data sourcesSpecialized medical knowledgeChatGPTText analysisPretrainingTaskData sourcesMedical applicationsMedical knowledgeEnhanced performanceTextPerformance
2025
Leveraging undecided cases in chart-reviewed phenotypes to enhance EHR-based association studies
Jian X, Zhang D, Yu Z, Xu H, Bian J, Wu Y, Tong J, Chen Y. Leveraging undecided cases in chart-reviewed phenotypes to enhance EHR-based association studies. Journal Of Biomedical Informatics 2025, 166: 104839. PMID: 40316004, DOI: 10.1016/j.jbi.2025.104839.Peer-Reviewed Original ResearchMeSH Keywords and ConceptsConceptsClinical Research NetworkAssociation studiesKaiser Permanente WashingtonBreast cancer eventsManual chart reviewRisk factor identificationHealth recordsCancer eventsEHR dataPhenotyping algorithmsCohort dataPatient clinical outcomesResearch NetworkRandom sampleADRDChart reviewFactor identificationCohortMean square errorMethod improves efficiencyOutcomesSimulation settingsAugmentation methodSBCEAlzheimer's diseaseImproving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media
Li Y, Viswaroopan D, He W, Li J, Zuo X, Xu H, Tao C. Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media. Journal Of Biomedical Informatics 2025, 163: 104789. PMID: 39923968, DOI: 10.1016/j.jbi.2025.104789.Peer-Reviewed Original ResearchCitationsMeSH Keywords and ConceptsConceptsTraditional deep learning modelsDeep learning modelsRecurrent neural networkLearning modelsEntity recognitionLanguage modelF1 scoreEnsemble of deep learningAdvances of natural language processingEffectiveness of ensemble methodsMicro-averaged F1Bidirectional Encoder RepresentationsExtensive labeled dataNatural language processingFine-tuned modelsBiomedical text miningFeature representationEncoder RepresentationsEvent extractionEntity typesText dataDeep learningSequential dataGPT-2Neural networkEvaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies
Jing N, Lu Y, Tong J, Weaver J, Ryan P, Xu H, Chen Y. Evaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies. Journal Of Biomedical Informatics 2025, 163: 104787. PMID: 39904407, PMCID: PMC12180398, DOI: 10.1016/j.jbi.2025.104787.Peer-Reviewed Original ResearchMeSH Keywords and ConceptsConceptsType I errorIntegrated likelihood estimatorsElectronic health recordsUse-case analysisLikelihood estimationLow prevalence outcomesUse-casesBias reductionNaive methodEffect sizeSynthetic dataPhenotyping algorithmsEstimation biasReal-world scenariosStatistical inferenceSimulation studyAssociation effect sizesAccurate prior informationBinary outcomesPoint estimatesAssociation estimatesStatistical powerHealth recordsKnowledge-guidedOutcome prevalenceEnvironment scan of generative AI infrastructure for clinical and translational science
Idnay B, Xu Z, Adams W, Adibuzzaman M, Anderson N, Bahroos N, Bell D, Bumgardner C, Campion T, Castro M, Cimino J, Cohen I, Dorr D, Elkin P, Fan J, Ferris T, Foran D, Hanauer D, Hogarth M, Huang K, Kalpathy-Cramer J, Kandpal M, Karnik N, Katoch A, Lai A, Lambert C, Li L, Lindsell C, Liu J, Lu Z, Luo Y, McGarvey P, Mendonca E, Mirhaji P, Murphy S, Osborne J, Paschalidis I, Harris P, Prior F, Shaheen N, Shara N, Sim I, Tachinardi U, Waitman L, Wright R, Zai A, Zheng K, Lee S, Malin B, Natarajan K, Price II W, Zhang R, Zhang Y, Xu H, Bian J, Weng C, Peng Y. Environment scan of generative AI infrastructure for clinical and translational science. Npj Health Systems 2025, 2: 4. PMID: 39872195, PMCID: PMC11762411, DOI: 10.1038/s44401-024-00009-w.Peer-Reviewed Original ResearchCitationsAltmetricConceptsInformation technology staffData securityGenerative AIClinician trustTechnology staffAI biasAI infrastructureEnvironment scanningNational Institutes of HealthNational Center for Advancing Translational SciencesComprehensive environmental scanCTSA programInfrastructureTranslational scienceSecurityDeploymentInstitutes of HealthNetworkEnvironmental scanCoordinated approachBiomedRAG: A retrieval augmented large language model for biomedicine
Li M, Kilicoglu H, Xu H, Zhang R. BiomedRAG: A retrieval augmented large language model for biomedicine. Journal Of Biomedical Informatics 2025, 162: 104769. PMID: 39814274, PMCID: PMC11837810, DOI: 10.1016/j.jbi.2024.104769.Peer-Reviewed Original ResearchCitationsAltmetricMeSH Keywords and Concepts
2024
Towards Enhanced Topic Discovery on Semantic Maps for Biomedical Literature Exploration
Choi B, Ondov B, He H, Xu H. Towards Enhanced Topic Discovery on Semantic Maps for Biomedical Literature Exploration. 2024, 00: 25-27. DOI: 10.1109/vahc65315.2024.00015.Peer-Reviewed Original ResearchConceptsSemantic mapHierarchical topic modelTF-IDF methodCentroid-based methodHierarchical topicsTopic discoveryGrowth of biomedical researchLabel generationTopic treeTopic modelsOverwhelming volumeNovel methodHierarchical clusteringPublic distributionLiterature explorationSemanticsMapsEnhanced visualizationHDBSCANTopicsLabelingMethodRepresentationVisualizationVolume of literatureSEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials
Lee K, Paek H, Huang L, Hilton C, Datta S, Higashi J, Ofoegbu N, Wang J, Rubinstein S, Cowan A, Kwok M, Warner J, Xu H, Wang X. SEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials. Informatics In Medicine Unlocked 2024, 50: 101589. PMID: 39493413, PMCID: PMC11530223, DOI: 10.1016/j.imu.2024.101589.Peer-Reviewed Original ResearchCitationsAltmetricConceptsAntibody-drug conjugatesOverall response rateMultiple myelomaF1 scoreCAR-TComplete responseBispecific antibodiesComparative performance analysisClinical trial studyClinical trial outcomesLanguage modelAccurate data extractionTherapy subgroupFine granularityOncology clinical trialsAdverse eventsClinical decision-makingPerformance analysisClinical trialsInnovative therapiesDiverse therapiesClinical trial abstractsCancer domainData elementsTherapyImproving tabular data extraction in scanned laboratory reports using deep learning models
Li Y, Wei Q, Chen X, Li J, Tao C, Xu H. Improving tabular data extraction in scanned laboratory reports using deep learning models. Journal Of Biomedical Informatics 2024, 159: 104735. PMID: 39393477, DOI: 10.1016/j.jbi.2024.104735.Peer-Reviewed Original ResearchCitationsAltmetricConceptsTree edit distanceOptical character recognitionTable recognitionDeep learning modelsAverage recallAverage precisionState-of-the-art deep learning modelsLearning modelsRegion-of-interest detectionState-of-the-artCharacter recognitionDetection evaluationTree editingTabular dataImpressive resultsLab test resultsLaboratory test reportsClinical documentationRecognitionLaboratory reportsHealthcare organizationsClinical data analysisDecision makingClinical decision makingTest reports
News
News
- May 14, 2025Source: Yale Medicine Magazine
Chatbot Revolution: From Me-LLaMA to GutGPT, YSM researchers leverage LLMs
- April 25, 2025
Yale BIDS Enhances Research with Comprehensive Data and Service Through YBIC
- March 18, 2025
Connecticut Academy of Science and Engineering Elects 12 From YSM
- February 28, 2025
Yale Researchers Use Large Language Models to Detect Gastrointestinal Bleeding
Get In Touch
Contacts
Biomedical Informatics & Data Science
100 College St
New Haven, Connecticut 06510
United States