In Depth

What Does Natural Language Processing Mean for Biomedicine?

By Kei-Hoi Cheung, PhD, Qingyu Chen, PhD, Hua Xu, PhD, Vipina K. Keloth, PhD & Andrew Taylor, MD, MHS

October 02, 2023

Several researchers at Biomedical Informatics & Data Science are interested in exploring natural language processing (NLP) in biomedicine. In this article, four of these scientists explain what NLP means for their research and share perspectives on the opportunities of this fast-growing field.

Kei-Hoi Cheung: Harnessing the Power of Big Healthcare Data

Dr. Kei-Hoi Cheung is a renowned researcher and educator, and Professor of Biomedical Informatics & Data Science at Yale. Dr. Cheung has also co-edited two books about the “semantic web” (a framework to make Internet data machine-readable). Recently, Dr. Cheung has begun NLP projects on annotating, extracting and retrieving information from clinical text as part of the Veteran Administration’s electronic medical records.

In the digital world of healthcare, Electronic Health Records (EHRs) have given rise to big patient data, including everything from patients’ diagnoses and social determinants of health, to lab test results, drug prescriptions, and medical histories. However, the bulk of patient data is not structured, but rather exists as clinical notes in free text form. Because traditional healthcare analytics have relied predominantly on structured data, a wealth of clinical data remains buried and unused as free text. We call this buried data “dark data”. Mining large amounts of clinical notes to find “dark data” is a major challenge in data science. Excitingly, natural language processing (NLP) has emerged as a tool that can help us overcome this challenge. NLP is a machine learning technology that automatically interprets, manipulates, and comprehends human language.

‎

NLP and ontologies can help researchers access the 'iceberg' of dark data.

Thus far, I have embarked on clinical NLP projects in extracting information from clinical text as part of the Veteran Administration (VA) EHRs. One of these projects includes extraction of different types of measurements (for example, pulmonary function tests) that are coded as text in clinical notes. We have developed an efficient Structured Query Language (SQL) algorithm that exploits full text index to search specific keywords that are nearby measurement values of interest. The algorithm generates a collection of snippets (small pieces of text) containing the measurement values associated with each keyword. We then use a search pattern called regular expression to find and extract the measurement value from each snippet. This keyword-snippet-based approach can help expedite annotation/curation effort by domain experts who can focus their effort on snippets instead of the whole clinical note. The annotated/curated snippets can be used to train machine learning models for similar information extraction needs.

I am also interested in applying biomedical ontologies to clinical NLP. Ontologies can play an important role in building language models, as they can be used to create customized Artificial Intelligence (AI) applications for specific clinical contexts. For example, the word “cold” could refer to a cold temperature or a common viral infection that causes a runny nose and sore throat. An ontology can provide this context, enabling the language model to understand which meaning is correct in each situation.

Clinical NLP and ontologies can enable researchers to harness the power of big healthcare data, including clinical notes, to gain important insights and advances in biomedical research.

Qingyu Chen: Navigating the Landscape of Biomedical Literature

Dr. Qingyu Chen is an Associate Research Scientist of Biomedical Informatics & Data Science. His research focuses on the use of biomedical text mining, medical image analysis and AI-assisted healthcare applications. Dr. Chen was awarded twice with the NIH Fellows Award for Research Excellence, listed in Top 50 Talent in AI in cross-disciplines by Baidu Scholar, and has received multiple teaching and mentorship awards.

A primary research focus for me is biomedical natural language processing (BioNLP). This field aims to automate information extraction and knowledge discovery from the vast and complex landscape of biomedical literature. The challenge in this domain lies in the sheer volume of biomedical literature and the unique hurdles it presents for curation, interpretation, and knowledge extraction. For instance, biomedical literature databases like PubMed alone see an increase of ~5,000 articles every day, totaling over 36 million. In addition to volume, biomedical literature also poses domain-specific challenges. A single entity like Long COVID can be described using 763 different terms. To overcome these challenges, BioNLP research plays a crucial role in assisting with manual curation, interpretation, and knowledge discovery.

‎

My research in BioNLP focuses on three main areas. First, I’ve conducted extensive research in developing foundational BioNLP models. Representative language models include BioWordVec, BioSentVec, BioConceptVec, and Bioformer. Second, I introduced innovative BioNLP methods for data curation (for instance, triaging relevant literature for annotation), information retrieval (for instance, searching for similar findings at PubMed scale), and information extraction (for example, extracting entities of interest and normalizing them using the ontologies). Third, I have also been doing research in downstream BioNLP applications. My colleagues and I have developed LitCovid (a central hub for tracking scientific literature about COVID-19), LitSense (a machine learning tool that finds, recommends and curates biomedical publications), and LitSuggest (a sentence-level search tool for biomedical literature), which have been accessed millions of times by biomedical researchers and healthcare professionals.

My current research involves integrating biomedical text with other data modalities to facilitate multi-modal analysis for AI-assisted disease diagnosis. I have also expanded my research into Large Language Models (LLMs) tailored to the biomedical domain. I have secured the K99 grant for this endeavor, and I am actively seeking talented individuals to join him on this exciting journey. If you are interested in exploring related opportunities, please do not hesitate to reach out.

Andrew Taylor: Applying NLP to Emergency Medicine

Dr. Andrew Taylor, MD, MHS, is Assistant Professor of Emergency Medicine and Director of Clinical Informatics and Analytics. In this role, he oversees numerous IT aspects in the ED including: building dashboards with sophisticated models (risk adjustment etc.), developing ED clinical decision support tools, and serving as the liaison to hospital IT leadership. Dr. Taylor's research focuses on applying data science to various aspects of emergency care.

Myself and my team are currently exploring various applications of NLP in the fields of education, clinical practice, and population health. One exciting recent development is our publication on comparing the performance of ChatGPT to medical students on the USMLE. This research was featured on Medpage Today and other news websites online. We are also actively studying the variability of responses through different prompt engineering and the automatic extraction of clinical decision rule scores.

‎

Comparing USMLE performance between medical students and Large Language Models

Additionally, in collaboration with Dr. Karen Wang, we have developed methods for identifying justice-related concepts in emergency notes. Our team is also focused on identifying concepts related to patient restraints, social determinants of health, as well as signs and symptoms of urinary tract infections. And finally, we are also making efforts to extend MedCat (a Natural Language Processing tool that can extract data from EHRs and link it to biomedical ontologies) to the emergency medicine domain.

Hua Xu: Addressing Real-World Challenges through NLP and LLM

Dr. Hua Xu is a widely recognized researcher in clinical natural language processing (NLP). He has developed novel algorithms for important clinical NLP tasks, such as “entity recognition” (identifying essential information in a text) and “relation extraction” (extracting semantic relationships in a written text). Xu has also led multiple national/international initiatives to apply developed NLP technologies to diverse clinical and translational studies, accelerating clinical evidence generation using electronic health records (EHR) data. Recently, he has utilized NLP to harmonize metadata of biomedical digital objects (indexing millions of biomedical datasets to make them findable), with the goal of promoting the Finding, Accessibility, Interoperability and Reusability (FAIR) principles of data reuse.

Can you imagine how arduous it would be if you had to sift through countless clinical notes, case studies and literature to uncover the information that you need? This is one of the reasons our lab is dedicated to harnessing the power of natural language processing (NLP) to develop cutting-edge algorithms to advance biomedical and clinical research and improve patient care. NLP algorithms allow us to extract all kinds of data: from medications and medical problems, to treatments, tests, adverse drug reactions or even smoking status. With years of research experience in designing innovative technologies and algorithms, developing and productionizing software tools to support information extraction, and applying these technologies to provide solutions to a wide range of clinical and biomedical problems, our lab has been established as a leading group in the field of clinical NLP.

‎

However, bringing these models to full-scale implementation and making them available to the research community is the true test. In the last few years, our team developed CLAMP (Clinical Language, Annotation, Modeling & Processing Toolkit), a comprehensive clinical NLP tool utilized by over 650 organizations to extract key information from biomedical textual data. From its humble origins as a student project, this endeavor has flourished into a revolutionary product that has found widespread application in tackling real-world challenges. The significance of these tools is evident when they are deployed to address real-world challenges, such as Alzheimer's disease, diverse forms of cancer, and in precision medicine, exploring the influence of social determinants of health. One emphasis of natural language processing has been on collaborating with clinicians and domain experts in applying these tools across various medical specialties, thus unlocking further insights and augmenting our understanding of complex healthcare dynamics.

As Large Language Models (LLM) become increasingly popular and powerful, my lab has its eyes set on the vast landscape of LLM research and their diverse applications. We are currently evaluating the performance of models such as GPT-4 and LLaMA on several clinical and biomedical tasks. Analyzing their strengths and weaknesses, we plan to focus on crafting systems that can surpass the limitations of current models and achieve remarkable advancements in healthcare applications. Whether you possess a passion for designing innovative systems, a knack for software development, or an eagerness to apply the transformative potential of NLP to extract profound insights from clinical and biomedical data, we hope you will join our lab to unleash your skills and contribute to groundbreaking advancements in biomedical informatics and data science.

Article outro

Authors

Media Contact

For media inquiries, please contact us.

ysmmedia@yale.edu