Hua Xu, PhD
Cards
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Associate Dean for Biomedical Informatics, Yale School of Medicine
Director, CBB MS Program , Biomedical Informatics & Data Science
Contact Info
Biomedical Informatics & Data Science
101 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Associate Dean for Biomedical Informatics, Yale School of Medicine
Director, CBB MS Program , Biomedical Informatics & Data Science
Contact Info
Biomedical Informatics & Data Science
101 College St
New Haven, Connecticut 06510
United States
Appointments
Additional Titles
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science
Associate Dean for Biomedical Informatics, Yale School of Medicine
Director, CBB MS Program , Biomedical Informatics & Data Science
Contact Info
Biomedical Informatics & Data Science
101 College St
New Haven, Connecticut 06510
United States
About
Copy Link
Titles
Robert T. McCluskey Professor of Biomedical Informatics and Data Science
Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science; Associate Dean for Biomedical Informatics, Yale School of Medicine; Director, CBB MS Program , Biomedical Informatics & Data Science
Biography
Dr. Hua Xu is Robert T. McCluskey Professor and Vice Chair for Research and Development, Department of Biomedical Informatics and Data Science at Yale School of Medicine (YSM). He also serves as Associate Dean for Biomedical Informatics at YSM. He received his Ph.D. in Biomedical Informatics from Columbia University. His primary research interests include biomedical natural language processing (NLP), large language models (LLMs), and AI agents, as well as their applications in clinical practice and biomedical research. His research is funded by multiple agencies (i.e., NLM, NCI, NIGMS, NIA, AHA, and CPRIT), and methods/tools developed in his lab have been widely used to support diverse biomedical applications. Dr. Xu is a fellow of both the American College of Medical Informatics (ACMI) and the International Academy of Health Sciences Informatics (IAHSI). See more information about Dr. Xu's lab here.
Appointments
Biomedical Informatics & Data Science
ProfessorPrimary
Other Departments & Organizations
Education & Training
- PhD
- Columbia University, Biomedical Informatics
- MS
- New Jersey Institute of Technology, Computer Science
- BS
- Nanjing University, Biochemistry
Research
Copy Link
Overview
Medical Research Interests
ORCID
0000-0002-5274-4672- View Lab Website
Clinical NLP Lab
Research at a Glance
Yale Co-Authors
Publications Timeline
Research Interests
Lucila Ohno-Machado, MD, MBA, PhD
Vipina K. Keloth, PhD
Qingyu Chen, PhD
Kalpana Raja, PhD, MRSB, CSci
Rohan Khera, MD, MS
Harlan Krumholz, MD, SM
Natural Language Processing
Publications
Featured Publications
Benchmarking large language models for biomedical natural language processing applications and recommendations
Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer M, Ai X, Lai P, Wang Z, Keloth V, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng W, Adelman R, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications 2025, 16: 3280. PMID: 40188094, PMCID: PMC11972378, DOI: 10.1038/s41467-025-56989-2.Peer-Reviewed Original ResearchCitationsAltmetricMeSH Keywords and ConceptsConceptsLanguage modelNatural language processing applicationsBiomedical natural language processingMedical question answeringLanguage processing applicationsNatural language processingGrowth of biomedical literatureMissing informationFew-shotQuestion AnsweringZero-ShotKnowledge curationLanguage processingProcessing applicationsBioNLPBART modelPerformance gapBiomedical literatureGeneral domainTaskBenchmarksBERTInformationPerformanceLLMMedical foundation large language models for comprehensive text analysis and beyond
Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, Peng X, Huang J, Zhang J, Keloth V, Zhou X, Qian L, He H, Shung D, Ohno-Machado L, Wu Y, Xu H, Bian J. Medical foundation large language models for comprehensive text analysis and beyond. Npj Digital Medicine 2025, 8: 141. PMID: 40044845, PMCID: PMC11882967, DOI: 10.1038/s41746-025-01533-1.Peer-Reviewed Original ResearchCitationsAltmetricConceptsText analysis tasksAnalysis tasksLanguage modelDomain-specific knowledgeZero-ShotHuman evaluationSupervised settingTask-specific instructionsClinical data sourcesSpecialized medical knowledgeChatGPTText analysisPretrainingTaskData sourcesMedical applicationsMedical knowledgeEnhanced performanceTextPerformanceImproving large language models for clinical named entity recognition via prompt engineering
Hu Y, Chen Q, Du J, Peng X, Keloth V, Zuo X, Zhou Y, Li Z, Jiang X, Lu Z, Roberts K, Xu H. Improving large language models for clinical named entity recognition via prompt engineering. Journal Of The American Medical Informatics Association 2024, 31: 1812-1820. PMID: 38281112, PMCID: PMC11339492, DOI: 10.1093/jamia/ocad259.Peer-Reviewed Original ResearchCitationsConceptsClinical NER tasksNER taskTask-specific promptsEntity recognitionLanguage modelTraining samplesState-of-the-art modelsFew-shot learningState-of-the-artMinimal training dataTask-specific knowledgeF1-socreAnnotated samplesConcept extractionModel performanceAnnotated datasetsTraining dataF1 scoreTask descriptionFormat specificationsComplex clinical dataOptimal performanceTaskEvaluation schemaGPT modelBiomedRAG: A retrieval augmented large language model for biomedicine
Li M, Kilicoglu H, Xu H, Zhang R. BiomedRAG: A retrieval augmented large language model for biomedicine. Journal Of Biomedical Informatics 2025, 162: 104769. PMID: 39814274, PMCID: PMC11837810, DOI: 10.1016/j.jbi.2024.104769.Peer-Reviewed Original ResearchCitationsAltmetricMeSH Keywords and Concepts
2025
TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature
Chang C, Ondov B, Choi B, Peng X, He H, Xu H. TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature. Journal Of Biomedical Informatics 2025, 172: 104958. PMID: 41242669, DOI: 10.1016/j.jbi.2025.104958.Peer-Reviewed Original ResearchMeSH Keywords and ConceptsConceptsAdjusted Mutual InformationBiomedical abstractsBiomedical literatureExpansion of biomedical literatureHierarchical topic modelHierarchical clusteringHierarchical clustering techniqueTopic hierarchyLabeling diversityTopic discoveryTopic summarizationClustering qualityClustering performanceManifold learningEmbedding modelTopic modelsLabel qualityLabeling frameworkSemantic spaceClustering techniqueMulti-scale explorationMutual informationCoherent labelingClustering methodDimension reductionExtracting language information from clinical notes using large language models
Qian L, Hong N, Zhou Y, Xie Q, Weng R, Chairuengjitjaras P, Du X, Lian J, Marshall G, Blackley S, Novoa-Laurentiev J, Quiroz Y, Kim T, Adams N, Dossett M, Zhou L, Xu H. Extracting language information from clinical notes using large language models. International Journal Of Medical Informatics 2025, 205: 106116. PMID: 40992205, PMCID: PMC12490899, DOI: 10.1016/j.ijmedinf.2025.106116.Peer-Reviewed Original ResearchCitationsMeSH Keywords and ConceptsConceptsLanguage informationLanguage modelElectronic health recordsField of electronic health recordsYale-New Haven HospitalNER frameworkZero-ShotEntity recognitionInformation extractionMIMIC datasetF1 scoreClinical narrativesPatient-provider communicationClinical notesPatient-centered careMIMIC-IIIEquitable healthcare deliveryService allocationSuperior performanceCross-site validationAutomated extractionOpen-source modelHealth recordsBERTPatients' language proficiencyScientific Writing in the Era of Large Language Models: A Computational Analysis of AI- Versus Human-Created Content
Khera R, Pedroso A, Keloth V, Xu H, Silva G, Schwamm L. Scientific Writing in the Era of Large Language Models: A Computational Analysis of AI- Versus Human-Created Content. Stroke 2025, 56: 3078-3083. PMID: 40814778, DOI: 10.1161/strokeaha.125.051913.Peer-Reviewed Original ResearchAltmetricMeSH Keywords and ConceptsConceptsLanguage modelArtificial intelligenceAI-generatedLinguistic featuresDetection toolsAI-generated contentHuman-written textLanguage perplexityHuman expertsPerformance of expertsLinguistic differencesScientific textsGrade levelWord countEssayLanguageScientific communicationScientific writingComputer synthesisHigher grade levelsTextScientific contentReadability scoresPerplexityFlesch-KincaidAccuracy of Large Language Models in Generating Rare Disease Differential Diagnosis Using Key Clinical Features.
Shyr C, Tinker R, Harris P, Cheng A, Byram K, Bastarache L, Peterson J, Hamid R, Xu H, Cassini T. Accuracy of Large Language Models in Generating Rare Disease Differential Diagnosis Using Key Clinical Features. Studies In Health Technology And Informatics 2025, 329: 1054-1058. PMID: 40776018, DOI: 10.3233/shti251000.Peer-Reviewed Original ResearchCitationsChanges in Cardiovascular Risk Factors and Health Care Expenditures Among Patients Prescribed Semaglutide
Lu Y, Liu Y, Totojani T, Kim C, Khera R, Xu H, Brush J, Krumholz H, Abaluck J. Changes in Cardiovascular Risk Factors and Health Care Expenditures Among Patients Prescribed Semaglutide. JAMA Network Open 2025, 8: e2526013. PMID: 40779264, PMCID: PMC12334959, DOI: 10.1001/jamanetworkopen.2025.26013.Peer-Reviewed Original ResearchCitationsAltmetricMeSH Keywords and ConceptsConceptsHealth care expendituresCardiovascular risk factorsCare expendituresCohort studyRisk factorsYale New Haven Health SystemCohort study of adultsType 2 diabetes statusLong-term impactStudy of adultsHealth systemRetrospective cohort studyBlood pressureHemoglobin A1c reductionMain OutcomesTotal cholesterolSentara HealthcareInpatient staySecondary outcomesGlucagon-like peptide-1 receptor agonistsPrimary outcomeHealthPeptide-1 receptor agonistsAssociated with clinical outcomesAssociated with reductionsLarge Language Models for Rare Disease Diagnosis at the Undiagnosed Diseases Network
Shyr C, Cassini T, Tinker R, Byram K, Embí P, Bastarache L, Peterson J, Xu H, Hamid R. Large Language Models for Rare Disease Diagnosis at the Undiagnosed Diseases Network. JAMA Network Open 2025, 8: e2528538. PMID: 40844783, PMCID: PMC12374213, DOI: 10.1001/jamanetworkopen.2025.28538.Peer-Reviewed Original ResearchCitationsAltmetric
Academic Achievements & Community Involvement
Copy Link
News
Copy Link
News
- March 04, 2026
AI in Medicine: Collaborating on Challenges and Opportunities
- September 16, 2025Source: NIH
Yale Team Recognized in NIH $1 Million Data Sharing Challenge
- July 01, 2025
Hua Xu, PhD, Receives NIH Supplement to Advance Mental Health Research
- April 25, 2025
Yale BIDS Enhances Research with Comprehensive Data and Service Through YBIC
Get In Touch
Copy Link
Contacts
Biomedical Informatics & Data Science
101 College St
New Haven, Connecticut 06510
United States
Events
Everyone Jeffrey Li
Everyone
Yale Only Douglas Fridsma - William K. Oh, MD - Eric Winer, MD - Ian Krop, MD, PhD - Sarim Khan - Daniella Meeker, PhD - Sanjay Aneja, MD - Christopher Whitlow, MD, PhD, MHA - Hua Xu, PhD - Shaili Gupta, MBBS, MHS - Meina Wang, PhD - Danielle Bitterman
Everyone Speakers to be announced.