Skip to Main Content

Chatbot Revolution

Yale Medicine Magazine, Spring 2025 (issue 174) AI for Humanity in Medicineby Isabella Backman

Contents

From Me-LLaMA to GutGPT, YSM researchers leverage LLMs.

You've heard of ChatGPT. Since its launch in 2022, it quickly became one of the fastest-growing applications (better known as “apps”) in history, amassing 100 million active users within its first two months. For many, the enormously popular chatbot from OpenAI was an eye-opener to the massive potential of large language models (LLMs)—a type of artificial intelligence (AI) designed to understand and generate human language.

Now, LLMs have become a major subject of interest in medical research. Scientists at Yale School of Medicine (YSM) are exploring ways to train specialized chatbots to act like a clinician’s personal AI assistant. As the thinking goes, LLMs could one day assist doctors with answering clinical questions quickly, diagnosing diseases, interpreting test results, selecting appropriate therapies, and more. By reducing physicians’ heavy workload, these chatbots could revolutionize the ways in which doctors deliver care and improve patients’ experiences.

“One of the biggest breakthroughs of AI in the past three years has been these large language models,” says Lucila Ohno-Machado, MD, PhD, MBA, Waldemar von Zedtwitz Professor of Medicine and Biomedical Informatics and Data Science (BIDS), deputy dean for biomedical informatics, and chair of BIDS. “With ChatGPT and other LLMs, everyone now has had a chance to see how these generative AI models work in practice.”

New LLMs receive specialized biomedical training

ChatGPT made headlines in 2023 after it identified the cause of a young boy’s mysterious chronic pain. His mother had taken him to 17 doctors over three years, but none could figure out the cause of her son’s suffering. Out of frustration, she then turned to ChatGPT. After entering as much information as she could on his condition, she finally received the long-awaited answer—tethered spinal cord syndrome. She made an appointment with a neurosurgeon who confirmed the chatbot’s diagnosis. The boy finally received surgery to treat his chronic pain.

Indeed, chatbots show potential in assisting doctors in diagnosing complex medical cases. One 2023 study in JAMA found that OpenAI’s chatbot GPT-4 accurately identified the final diagnoses of challenging medical cases 39% of the time, and it included the correct diagnosis in its list of possible conditions 64% of the time. While promising, the chatbot’s lack of specialized training, however, still leaves much room for improvement.

Me-LLaMA is a novel family of LLMs introduced by YSM researchers. This chatbot is similar to its cousins ChatGPT and GPT-4, but these LLMs are closed-source—meaning they aren’t easily accessible to or customizable by researchers.

To address this issue, Hua Xu, PhD, Robert T. McCluskey Professor of Biomedical Informatics and Data Science and assistant dean for Biomedical Informatics, and his team are developing this new family of LLMs collectively known as Me-LLaMA, which is one of the first and largest open-source models to be trained on extensive biomedical and clinical data. “Me-LLaMA is an open-source medical foundation model that we are continuously training on large amounts of biomedical text and releasing to the community,” Xu says. His team used over 129 billion tokens—small pieces of text, like words or parts of words, that the model processes—to train Me-LLaMA. “We are doing both pre-training and fine-tuning to improve its performance on many biomedical applications.”

Xu’s team is training these models on massive amounts of data, including millions of biomedical articles from the PubMed database, clinical notes from anonymized databases, clinical guidelines, and more. The researchers are also studying how well the models perform various tasks. For example, users can ask the chatbot questions about specific publications or ask it to extract relevant information about a clinical trial.

The researchers are also comparing the performance of Me-LLaMA and other LLMs using publicly available datasets that test these models in different areas, such as answering medical questions. So far, they are finding that Me-LLaMA outperforms such other existing open medical LLMs as Meditron-70B and such commercial models as ChatGPT and GPT-4 across these kinds of tasks.

“We are showing that large language models have great potential as an AI assistant that helps with clinical diagnostic reasoning, accelerating clinical documentation, and making clinical work more efficient while improving patient care,” says Qianqian Xie, PhD, associate research scientist in Xu’s lab. Xie is currently exploring Me-LLaMA’s ability not only to come up with potential diagnoses when given a summary of a particular case but also to explain its reasoning for each one.

Updating Me-LLaMA requires significant computational resources. Fortunately, says Xu, Yale is dedicated to supporting the development of robust graphics processing units (GPUs) infrastructure. Recently, the Office of the Provost announced it will invest over $150 million in AI development.

From ChatGPT to GutGPT

Other chatbots are undergoing an even greater degree of specialization. At the Yale Center for Healthcare Simulation, which allows members of YSM and the Yale New Haven Health System to practice using various clinical methodologies, researchers are evaluating the usefulness of an LLM designed to support clinicians treating patients with gastrointestinal (GI) bleeding.

Meet GutGPT, an LLM trained on the latest clinical practice guidelines on GI bleeding—the most common GI condition that leads to hospitalization in the United States. The chatbot is designed to help clinicians predict the severity of a patient’s bleed and offer evidence-based treatment recommendations.

GutGPT’s development is being led by Dennis L. Shung, MD, PhD, MHS, assistant professor of medicine (digestive diseases). The first thing a gastroenterologist needs to do when a patient enters the emergency department with acute gastrointestinal bleeding is to assess whether the individual needs inpatient treatment. For about half of these patients, he says, hospitalization is not necessary. “They just need to go home and make an appointment for an outpatient endoscopy.”

Shung wondered whether AI could help clinicians distinguish low-risk patients who could safely go home from high-risk patients who need hospitalization. Using electronic health records from patients seen in the Yale New Haven Health System for GI bleeding, his team created a machine learning model that produces a clinical risk score to identify very-low-risk patients who do not require a hospital-based intervention for gastrointestinal bleeding. The model allows providers to obtain a real-time risk assessment quickly while the patients are in the emergency department, saving valuable time.

Shung’s team then integrated the risk model into GutGPT. “Now, providers can not only ask, ‘What is the risk?’ but also things like ‘Why is the risk so high?’ or ‘What are the factors driving it?’” Shung says. Beyond risk assessment, users can also ask the chatbot a range of questions about how best to treat patients with gastrointestinal bleeding with evidence-based responses drawn from national clinical guidelines.

To better understand how physicians, residents, and medical students use these tools, Shung’s team conducted a randomized controlled trial at the Yale Center for Healthcare Simulation measuring the efficacy of GutGPT versus the risk model alone. By observing how users engage with GutGPT, the researchers are learning how to fine-tune the chatbot and optimize its usefulness. Shung’s team has published results of its work on the machine learning model in Gastroenterology, and the preliminary qualitative results of GutGPT in the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems.

Because risk changes over time—a bleed may worsen after a patient enters the hospital—Shung’s team has also explored novel generative AI algorithms to better understand an individual’s risk trajectory. “We’re really excited about not only identifying the risk initially, but being able to track a patient over time so we can identify those who need more attention versus those who are stable and should go home.” In conjunction with a group led by Yoshua Bengio, PhD, at Mila-Quebec Artificial Intelligence Institute, they have published their work in Proceedings of the Advances in Neural Information Processing Systems.

Beyond text: Researchers teach LLMs to process medical imagery

LLMs have grown tremendously in their ability to understand, process, and generate text. But other LLMs that have the ability to “see” or “hear” are also in the works. In other words, researchers are training LLMs to interface with input beyond text, including images and audio.

Over 3 million adults in the United States aged 40 and older suffer from severe visual impairment—and experts predict that this number will double by 2030. Early detection of such eye diseases as diabetic retinopathy, cataracts, and glaucoma can help prevent vision loss or blindness.

In the lab of Qingyu Chen, PhD, assistant professor of biomedical informatics and data science, his team is developing and studying the potential of multimodal LLMs that process both text and medical images to help doctors analyze eye images, diagnose ophthalmic diseases, and formulate treatment plans. The current version includes an ophthalmology-specific foundational language model called Language Enhanced Model for Eye (LEME). It has been evaluated on 10 downstream tasks in ophthalmology, including real-world validation on patient records for disease diagnosis and management. It has outperformed eight baseline LLMs in commercial, general, and medical domains.

“A clinician can, for example, directly input medical images from the patients, and then the AI system will make predictions [about the patient’s condition]. It can also highlight features of the image to aid the decision-making,” says Chen. “Based on these insights, the doctor can make the final decision [on disease management].”

Chen’s team has also created a large set of data specifically for testing these LLMs known as LMOD. LMOD is the first multimodal ophthalmology benchmark consisting of 21,993 instances across several ophthalmic imaging modalities: free-text, demographic, and disease biomarker information; and such primary ophthalmology-specific applications as anatomical information understanding, disease diagnosis, and subgroup analysis. This dataset contains different types of ophthalmic images as well as texts on patient demographics and disease biomarkers. The team’s research has identified the shortcomings of models in interpreting clinical imagery and offered insights into ways to improve the reliability of these models.

As scientists continue to troubleshoot these models, Chen is excited about their potential to boost accessibility to eye care. “AI assistants could be useful at community or primary care centers in areas with limited access to specialists,” he says. “In the future, these centers can deploy these AI-assisted techniques to reduce health disparities.”

Meanwhile, at the Cardiovascular Data Science (CarDS) Lab, researchers are training LLMs to interpret diagnostic cardiac imaging. ECG-GPT, for example, is a novel model that analyzes images from electrocardiograms and generates full text reports. The tool is currently available for use on the CarDS Lab website.

“These projects are meant to enhance the scalability of diagnostic services by making workflows easier at our very overloaded health systems and diagnostic environments,” says Rohan Khera, MD, MS, assistant professor of medicine (cardiovascular medicine) and of biostatistics (health informatics). “Often, the bottleneck is the expertise in interpreting [cardiac diagnostic tests]. This might just change how we deliver care.”

Khera is similarly optimistic about the potential of these LLMs to bring diagnostic testing to areas that lack adequate medical resources. In parts of the world where there are very few experts, waiting lists to see a cardiologist who can interpret cardiac tests can be extensive and greatly prolong the time it takes to receive necessary treatment. ECG-GPT, for instance, could more rapidly produce test reports that a specialist would confirm and then initiate treatment.

Khera’s team is launching multiple studies that evaluate the performance of their models in underserved clinical settings around the world. “We have the opportunity to democratize high-quality care,” says Khera. “You wouldn’t have to be near the best doctors—we can create algorithms that can transform care for the majority of the population.”

Previous Article
Supercharged Data
Next Article
Can AI Predict the Future?