Pathology Grand Rounds, November 7, 2024
November 08, 2024Adam Rodman, MD, MPH, FACP, of Harvard Medical School and Beth Israel Deaconess Medical Center, presents on, "Towards an AI Second Opinion: Clinical Reasoning Large Language Models and How We Might Make Humans a Little Better."
Information
- ID
- 12333
- To Cite
- DCA Citation Guide
Transcript
- 00:00Hi, everybody. Welcome to Grand
- 00:02Pathology Grand Rounds.
- 00:04I'm Rob Homer.
- 00:06I am exceptionally pleased to
- 00:08invite someone,
- 00:09to speak to.
- 00:11It's, Adam Rodeman's, has a
- 00:13dual degree in medicine public
- 00:14health from Tulane,
- 00:16trained in internal medicine at
- 00:17Oregon Health and Science University,
- 00:20the global health fellow at
- 00:21the Beth Israel Deaconess, where
- 00:23he actually wound up working
- 00:24in Botswana for a couple
- 00:25of years and wound up
- 00:26developing a curriculum for, interns
- 00:29in Botswana, which is great.
- 00:31He's currently assistant professor of
- 00:33medicine at Harvard Medical School
- 00:34and attending internist at Beth
- 00:36Israel Deaconess.
- 00:38So I became familiar with
- 00:39Adam from his,
- 00:41well known podcast
- 00:43on bedside rounds,
- 00:44which is phenomenal because it
- 00:46goes into the history of
- 00:47medicine. And I have to
- 00:48say that if you wanna
- 00:50understand, like, every time you
- 00:51look around and says, why
- 00:52are things like this? It
- 00:53turns out there frequently is
- 00:54a reason, and it's frequently
- 00:55found in history. And it
- 00:57was like, oh, that's why
- 00:58we do this. And so
- 00:59it really provides a lot
- 01:00of insight, and I surely
- 01:01recommend it. Also, I have
- 01:03to say, at dinner last
- 01:05night, I found Adam just
- 01:06enthusiasm. If we could just
- 01:08harness that, we don't need
- 01:09nuclear. We don't need solar.
- 01:11We just need Adam. It's
- 01:12phenomenal.
- 01:14He is,
- 01:15having his history. So he's
- 01:17trans he transformed from history
- 01:19to
- 01:20AI. How did that happen?
- 01:21Because there's a history of
- 01:22machine learning in medicine, which
- 01:23goes back a long way.
- 01:24When I was going to
- 01:25medical decided to go to
- 01:26medical school when I was
- 01:27in college, one of my
- 01:28classmates said, why are you
- 01:29doing this? Computers are gonna
- 01:30take this over any day
- 01:31now. And that was in
- 01:32nineteen seventy nine. So a
- 01:34little maybe it's true, but
- 01:36it just took a little
- 01:37while.
- 01:39His research founded, funded by
- 01:41the Macy Foundation,
- 01:42Gordon Betty Moore Foundation, National
- 01:44Academy of Medicine, explores integration
- 01:46of large language models and
- 01:47complex diagnostic challenges, which you're
- 01:49gonna hear today,
- 01:50associate editor for the New
- 01:52England Journal of Medicine AI,
- 01:53published work in JAMA Open
- 01:55Network, JAMA Internal Medicine, New
- 01:56England Journal of Medicine, among
- 01:57other journals.
- 01:59But
- 02:00and, actually, you know, the
- 02:01rationale for really being here,
- 02:03he's planning the innovation of
- 02:04digital in education and medicine,
- 02:06codirects at the EyeMed program,
- 02:08which is dedicated to exploring
- 02:09digital education.
- 02:10He is among other things,
- 02:12he has a digital education
- 02:13track for future educators to
- 02:15learn about,
- 02:16electronic education
- 02:18technology.
- 02:19Currently leader of a task
- 02:20force to integrate, into the
- 02:21Harvard Medical School undergraduate
- 02:23medical education.
- 02:25Received numerous teaching awards,
- 02:27and mentorship awards, including most
- 02:29recently the Herman Blumgard faculty
- 02:31teaching award.
- 02:32Join me in welcoming Adam.
- 02:34I'm really excited to hear
- 02:35this. Thank you. This is
- 02:36gonna be really cool. I'm
- 02:37a hi, everybody. I'm Adam.
- 02:41Okay. Can you guys hear
- 02:42me? Oh, good. I'm a
- 02:43pacer. Before I start
- 02:45oh, I'm so sorry. If
- 02:46you're an internal medicine physician,
- 02:48could you raise your hand?
- 02:51I've had a feeling I
- 02:52was gonna pick on you
- 02:52guys. I would like actually,
- 02:54so
- 02:55at
- 02:56about twenty minutes in, I'm
- 02:58going to demonstrate
- 02:59our actual clinical workflows.
- 03:01If one of the internists
- 03:02would be willing it's not
- 03:04a competition, but to actually
- 03:05use the AI to one
- 03:07of the real patients from
- 03:07our studies, would you be
- 03:08willing to?
- 03:11Yeah. Okay.
- 03:12So hi. My name is
- 03:13Adam. I, I'm not a
- 03:15pathologist,
- 03:16so I apologize in advance.
- 03:18I am I don't know.
- 03:19I don't know who's nerdier,
- 03:20but in general, internists are
- 03:21some of the nerdiest physicians
- 03:22alive.
- 03:23I got a nod over
- 03:24there. So
- 03:26I research how people think.
- 03:28And in pathology
- 03:29and when it comes to
- 03:30machine learning in particular, there's
- 03:32really been a focus
- 03:33unfairly, I think, on, like,
- 03:35classification models. Right? Just what
- 03:37is this image? Which I
- 03:38think, well, not I think.
- 03:40I know underplays the actual
- 03:41cognitive processes that go in
- 03:43pathology, which is why everyone
- 03:45still here has a job
- 03:46and is in no danger
- 03:47of losing a job anytime
- 03:48soon. Now what I'm going
- 03:50to talk about today is
- 03:51my research, but I'm gonna
- 03:53try to make it entertaining
- 03:54to talk about
- 03:56language models,
- 03:57reasoning, multimodal models, and what
- 04:00the data in here, I'll
- 04:01just have, you know, I
- 04:02was talking to my PhD
- 04:03student this morning. I have
- 04:04updated this slide with data
- 04:05that has not been published
- 04:06yet that was run yesterday.
- 04:07So this is literally
- 04:09the,
- 04:10latest breaking data here. But
- 04:13to give you an idea
- 04:13of where we've come already
- 04:15and where we're gonna
- 04:16go. So
- 04:17I wanna start with this
- 04:18idea that is relevant to
- 04:20basically every medical specialty, and
- 04:21it's what is diagnosis? What
- 04:23does it mean to make
- 04:25a diagnosis? This is what
- 04:27we all do. This is
- 04:27theoretically what internists do, though
- 04:29often we don't get a,
- 04:30we don't get a definitive
- 04:31diagnosis. This is what we
- 04:32do in pathology.
- 04:34Basically, all of medicine
- 04:36is not all of medicine,
- 04:38but it's focused on this
- 04:39idea of diagnosis. Even when
- 04:40we're talking downstream about
- 04:42management, it
- 04:44usually
- 04:45relies on the diagnostic process.
- 04:47But what does it mean
- 04:48to make a diagnosis? So
- 04:50there's a great paper. I
- 04:51always say it was just
- 04:52published. It was actually published
- 04:52about fifteen years ago. I'm
- 04:53just getting old and unstuck
- 04:55in time, but there's a
- 04:56great paper that looks at,
- 04:59like, what do we mean
- 04:59when we see diagnosis? What
- 05:00do we mean when we
- 05:01say clinical reasoning? And unfortunately,
- 05:04as I'm sure everybody here
- 05:06knows, there's not a standard
- 05:08definition and everybody is talking
- 05:09about something a little bit
- 05:11different, which means we have
- 05:12a tendency to talk past
- 05:13each other. So to give
- 05:15you a sense on where
- 05:16the science stands, it's complicated.
- 05:18Yeah. I'm gonna do some
- 05:19Simpson stuff in here. So
- 05:20I realized that as a
- 05:21geriatric millennial that that dates
- 05:23me quite considerably.
- 05:24But at its most basis,
- 05:26basic dating back to the
- 05:27ancients, diagnosis is a classification
- 05:30task. Like, what is nosology?
- 05:31Nosology is the way that
- 05:32we categorize
- 05:33diseases.
- 05:34And fundamentally, making a diagnosis
- 05:36means, okay. I have to
- 05:38come up with a single
- 05:39disease or multiple diseases in
- 05:41my differential out of this
- 05:42large classification
- 05:44schema. Well, that sounds easy.
- 05:45I, I think what's funny
- 05:47is if you go back
- 05:48to some of the early
- 05:49literature, like this is from
- 05:50the fifties on diagnosis, they
- 05:51literally thought it was this
- 05:52easy. Everybody here knows it's
- 05:54much more complicated than that.
- 05:57So modern researchers into the
- 05:58nature of diagnosis have focused
- 06:01more on human psychology.
- 06:03So, RIP Danny Kahneman,
- 06:05if anybody has read Thinking
- 06:07Fast and Slow lately, I'm
- 06:08I'm sorry. I'm gonna very
- 06:10quickly summarize it. I will
- 06:11also say I actually recently
- 06:12reread Thinking Fast and Slow
- 06:14just, like, two months ago.
- 06:15It is I don't recommend
- 06:16reading it. Danny Kahneman, brilliant
- 06:18person,
- 06:18very droll
- 06:20writer.
- 06:21Again, RIP Danny Kahneman. So
- 06:23our understanding of modern diagnosis
- 06:25is, you know, based on
- 06:26cognitive psychology.
- 06:28So this idea
- 06:30that
- 06:31most of the process is
- 06:32this very fast, very contextual
- 06:35automatic system one process. So
- 06:37system one, very classically, I
- 06:39I great example is you're
- 06:40driving on the highway going
- 06:41home. All of a sudden
- 06:43you stop paying attention. Fifteen
- 06:44minutes go by. You're still
- 06:45on the highway. Everything's going
- 06:47well. What happened? Your brain
- 06:49went into these very automatic
- 06:51thought processes and that's our
- 06:52classic system. One,
- 06:54we talk about heuristics or
- 06:56mental shortcuts often system one
- 06:58gets maligned, right? We spend
- 06:59a lot of time talking
- 07:00about how system one can
- 07:01lead us wrong. But the
- 07:02fact of the matter is
- 07:03system one exists. We evolved
- 07:05this way for a reason.
- 07:06It's fast, it's efficient, and
- 07:07it works pretty well.
- 07:09System two are the very
- 07:10slow, very contextual,
- 07:13very labor intensive, and theoretically
- 07:16less biased, not in reality,
- 07:17but theoretically less biased kind
- 07:19of formal thought processes.
- 07:21So what does this look
- 07:22like in process? Don't worry.
- 07:23I'm not gonna go over
- 07:24this. This is, from Pat
- 07:25Cross Gary who I've talked
- 07:26about a a bit over
- 07:27the last day. So this
- 07:28is, Pat Cross Gary's grand
- 07:30unified
- 07:31theory of clinical reasoning. Again,
- 07:33don't worry about it, but
- 07:34the gist is
- 07:35we cross over between system
- 07:36one and two. What does
- 07:37this look like in practice?
- 07:39So I think probably everybody
- 07:40has read or is aware
- 07:41of Judith Bowen's very famous
- 07:43paper from the early two
- 07:44thousands, which is based on
- 07:45the work of Bourdage about
- 07:46a decade before. But our
- 07:48current understanding,
- 07:50cognitive psychological understanding of the
- 07:52diagnostic process is something called
- 07:54script theory. And this is
- 07:55a knowledge encoding and knowledge
- 07:57activation theory.
- 07:59So
- 08:00script theory, how does it
- 08:01work? The idea is that
- 08:02we know things about our
- 08:03patients. We know things about
- 08:05path pathology from a variety
- 08:07of sources. Medical school, of
- 08:09course, reading journal articles, and
- 08:10more importantly, from seeing patients
- 08:12and from learning, from our
- 08:13practice, we get feedback. And
- 08:15all that knowledge gets encoded
- 08:17in our brains, but we
- 08:18are not a library. We
- 08:19are not the Dewey decimal
- 08:21system. I do not say,
- 08:23heart failure with risk reduction
- 08:24fraction
- 08:25is coded at e twenty
- 08:27three dot o five. That's
- 08:28not the Dewey decimal system.
- 08:29That's like another classification system
- 08:31that I forget. I spend
- 08:32too much time in academic
- 08:33libraries. But, no, how does
- 08:34that information get encoded? And
- 08:36it gets encoded
- 08:38in these things that in
- 08:39medical education and the psychology
- 08:41world we call scripts. Scripts
- 08:43are a psychological principle actually
- 08:44from the nineteen seventies. It
- 08:46comes from this idea of,
- 08:48there are stereotyped
- 08:49patterns of behavior. So for
- 08:50example, you walk into a
- 08:51restaurant,
- 08:52you have a script that
- 08:54you follow, and when you
- 08:55deviate from that script, it
- 08:56freaks people out. And if,
- 08:57for example, you were to
- 08:58go to your friend's house
- 08:59for dinner and you try
- 08:59to follow the restaurant script,
- 09:00it would be incredibly rude.
- 09:02So, you know, in the
- 09:03early nineties,
- 09:04psychologists who study clinical reasoning
- 09:06realized we're doing the same
- 09:07thing with diseases.
- 09:09So the idea here is
- 09:10that when we
- 09:12activate information about a disease,
- 09:14what we're really doing is
- 09:15telling a story to ourselves.
- 09:16And that story has, obviously,
- 09:18the presentation.
- 09:19That story has the pathological
- 09:21diagnosis, what that might look
- 09:23like. It has other, you
- 09:25know,
- 09:26the the treatments, all of
- 09:27that is organized together. And
- 09:29then more importantly, scripts do
- 09:30not exist
- 09:32in, isolation. It's not like
- 09:34a library. They exist in
- 09:35these parallel systems of networks
- 09:38called schema, where almost by
- 09:40definition, if you activate one
- 09:42script, you're excluding another. And
- 09:44in medical education, we spend
- 09:45a lot of time, you
- 09:46know, talking to our students
- 09:48about semantic qualifiers.
- 09:50You know, is this acute,
- 09:51subacute, chronic,
- 09:53polyarticular, monoarticular? And the reason,
- 09:55the psychological reason,
- 09:58this is right out of
- 09:58bordage, is that fundamentally
- 10:02those are the ways that
- 10:03we include or exclude different
- 10:05diagnoses. This goes back to
- 10:06a really interesting study by
- 10:07Arthur Elstein in the late
- 10:0870s, where at Harvard took
- 10:10a bunch of medical students,
- 10:11took a bunch of attendings,
- 10:12and then had them think
- 10:13out loud
- 10:15about what was going on.
- 10:16And as he expected, as
- 10:18the dominant theory was, everyone
- 10:20did the Sherlock Holmes thing.
- 10:22Everyone was like, okay, this
- 10:23is my theory. These are
- 10:24the next tests, the hypothetical
- 10:26deductive process. But what else
- 10:28he noted is, okay,
- 10:30well, the attendings only asked
- 10:32five questions and they got
- 10:34the diagnosis right. The medical
- 10:35students asked thirty questions and
- 10:37they were far less accurate.
- 10:39So there's something else going
- 10:41on than simply following a
- 10:42hypothetical deductive process. So what
- 10:45does this look like in
- 10:46practice?
- 10:47So I'm in clinic. A
- 10:48patient comes up to me.
- 10:49She says what did she
- 10:50say? I can't breathe. It
- 10:51started yesterday.
- 10:53Everyone has had this experience.
- 10:54I mean, it's a little
- 10:55bit different with pathology, but
- 10:56you look at something and
- 10:57then automatically your mind starts
- 10:59to sort that information. What
- 11:01is happening? You are activating
- 11:02one of these scripts. In
- 11:04medicine, we talk about a
- 11:05problem representation, the instantiation of
- 11:07a problem representation,
- 11:09which is in this case,
- 11:10acute dyspnea.
- 11:11This is a way of
- 11:12teaching, but it's activating that
- 11:13part of my mind. This
- 11:14idea that if somebody tells
- 11:15me all of a sudden
- 11:16I can't breathe versus six
- 11:18months slowly progressive shortness of
- 11:19breath fundamentally activates different ways
- 11:22that I think about information.
- 11:23And then
- 11:24and everyone has had this
- 11:25experience. Without thinking, you know
- 11:28what questions to ask next.
- 11:30Why? Because they're going into
- 11:32your script for acute dyspnea.
- 11:33You are going through that
- 11:34schema, and you're asking follow-up
- 11:36questions. So I hypothesis test
- 11:38COPD.
- 11:39Hypothesis test pneumonia,
- 11:41and based on that, further
- 11:42refine my problem representation
- 11:45until I get something that
- 11:47is reasonably close to the
- 11:48diagnosis or I find something
- 11:51that doesn't make sense and
- 11:52I cross over to other
- 11:54types of metacognitive processes. Now,
- 11:56I said that I'm going
- 11:57to go over some of
- 11:58the, like,
- 11:59different understandings of clinical reasoning
- 12:01because I will say that
- 12:02until,
- 12:03let's say, about twenty ten,
- 12:05we had basically agreed in
- 12:07the reasoning field that script
- 12:09theory was it.
- 12:10I think that
- 12:12it doesn't take much to
- 12:13think about the problems with
- 12:15script theory and that it's
- 12:16not a necessary or sufficient
- 12:18way to think about how
- 12:19we make decisions. So my
- 12:20classic example is right now
- 12:21I've had about a half
- 12:22gallon of coffee. So let's
- 12:23say I'm admitting patients, I'm
- 12:24well rested, well caffeinated, ten
- 12:26AM.
- 12:27My mind is gonna work
- 12:28differently than when I'm working
- 12:29at two AM. My pager
- 12:32goes off every five minutes,
- 12:33and my epic chat just
- 12:34keeps going red over and
- 12:35over and over again because
- 12:37our minds don't I mean,
- 12:39they don't work like that.
- 12:40There are a lot of
- 12:41ecological
- 12:42factors that affect our reasoning
- 12:44process. And this is ecological
- 12:46psychology is one of the
- 12:48frames now that we think,
- 12:49which is that reasoning is
- 12:50not something that just happens
- 12:51up here. Reasoning is something
- 12:53that happens in our bodies,
- 12:55happens in our environment, and
- 12:56these environmental factors play a
- 12:57large part. Again, I don't
- 12:58think any of this is
- 12:59controversial.
- 13:01I will put up this
- 13:02idea of situated cognition.
- 13:04This so to say that
- 13:05something is situated is to
- 13:06say that it's contingent. These
- 13:07are all psychological terms. But
- 13:09the idea here is that
- 13:10there's no such thing as
- 13:11neutral decisions. They're all shaped
- 13:13by our prejudices, by our
- 13:14life experiences.
- 13:16And because of that decision
- 13:17making, per situated cognition
- 13:20is individualized.
- 13:22I don't find this very
- 13:23helpful, like, from somebody who
- 13:24wants to, you know, make
- 13:25people better. I don't find
- 13:26this a very helpful psychological
- 13:27theory, but it is one
- 13:28that is out there that
- 13:29you should know about. And
- 13:30then one of the big
- 13:32things that I think is
- 13:32especially relevant in pathology is
- 13:34I think about how I
- 13:35make diagnosis in the real
- 13:36world. Also, you can tell
- 13:37this is an AI generated
- 13:38images because a lot of
- 13:39these, have, like, weird demon
- 13:41faces as the model gets
- 13:43confused and tries to average
- 13:44Simpsons characters and other things.
- 13:46But, you know, I might
- 13:47make a diagnosis, and I'll
- 13:48pat myself on the back
- 13:49that maybe I've diagnosed lupus
- 13:51nephritis and a new case
- 13:52of lupus. But you know
- 13:53what? What I'm doing is
- 13:54I'm looking at a creatinine
- 13:55trend that was identified by
- 13:57the patient's PCP, and nephrologist
- 13:58is an outpatient, and I
- 14:00might be looking at an
- 14:00ANA or a double stranded
- 14:02DNA that was sent much
- 14:03later, and all of this
- 14:04is mediated through the electronic
- 14:05health record. So the reality
- 14:07of diagnosis, it's not really
- 14:09collaborative, right? It's not in
- 14:11the sense that we're in
- 14:11the same room working together,
- 14:13but most
- 14:14diagnosis, I don't know most,
- 14:16I don't know the percentage,
- 14:17but a lot of diagnosis
- 14:18in the year twenty twenty
- 14:19four is mediated through the
- 14:20electronic health record happening with
- 14:22different people at different places
- 14:24in time.
- 14:25Okay. So why do we
- 14:27care so much about the
- 14:28cognitive processes of reasoning? Like,
- 14:31why why does anybody give
- 14:32me money money to study
- 14:33this, which I wonder myself
- 14:35also sometimes?
- 14:36And for the last, let's
- 14:37say, twenty five years, the
- 14:39focus has been on this.
- 14:40Ever since to is human
- 14:41has come out, there has
- 14:42been a huge focus on
- 14:44medical errors and the burden
- 14:45of medical errors. This is
- 14:46from David Newman Toker's, most
- 14:47recent paper. I actually think
- 14:49that this is what he
- 14:50pulled out. I think the
- 14:51more damning thing is that
- 14:53he says fifteen diseases are
- 14:54for half of serious harms.
- 14:55Four diseases accounted for thirty
- 14:57nine percent of harms in
- 14:58this study, and those diseases
- 14:59were such exotic things as
- 15:00stroke and heart attacks. So
- 15:03doctors make misdiagnoses that cause
- 15:04harm all the time. We're
- 15:06not missing lupus nephritis. Well,
- 15:08we are, but we're also
- 15:08missing, like, heart attack and
- 15:10stroke.
- 15:11This is from Andy Auerbach's
- 15:13group. Fantastic
- 15:14study that looked at
- 15:17very specific cognitive errors because,
- 15:19you know, one of the
- 15:19challenges always been, yeah, there's
- 15:21there's diagnostic errors. Of course,
- 15:22they are. But is that
- 15:23because of our human brains,
- 15:25or is that because of
- 15:26other systems factors? I mean,
- 15:27we all know that
- 15:29getting outside records can be
- 15:31difficult. There's all sorts of
- 15:32interruptions. So maybe it's just
- 15:34our fragmented health system that's
- 15:35doing this. And David Newman
- 15:37Joker, what what he sorry.
- 15:38Andy Auerbach, what he did
- 15:40is looked at fourteen different
- 15:41hospitals across the US, patients
- 15:43who were admitted to the
- 15:44hospital, appropriately triaged, meaning that
- 15:45they didn't decompensate within the
- 15:47first forty eight hours, and
- 15:48then either went to the
- 15:49ICU or died. And in
- 15:50those patients, a quarter of
- 15:52them had a diagnostic error,
- 15:54and a lot of these
- 15:56were severe. One fifth of
- 15:57those errors caused severe harm
- 15:58or death. And when we
- 16:00looked at though, he looked
- 16:01at the reasons, the most
- 16:02common cause was this, errors
- 16:04in human cognition by far,
- 16:05almost half, with the second
- 16:07biggest being in test interpretation
- 16:08and ordering, which is also
- 16:09a human cognitive error. So
- 16:11systems errors were actually much
- 16:12less common. Mind you, this
- 16:13is a very specific situation,
- 16:15so patients in the hospital
- 16:16who are decompensating, so I
- 16:17don't know that that's generalizable.
- 16:19But in this population, cognitive
- 16:21errors are common.
- 16:23So
- 16:24in my field,
- 16:25which I guess is medical
- 16:26education, reasoning has usually been
- 16:27within the med ed field,
- 16:29but we've looked historically at
- 16:31ways that we can make
- 16:32people better.
- 16:33Three big categories. So first,
- 16:36everyone here who is my
- 16:38age or less,
- 16:40I'm forty,
- 16:41has probably
- 16:42received some education about cognitive
- 16:44biases. Like, actually, raise your
- 16:46hand if you've ever been
- 16:46taught about cognitive biases.
- 16:49Yeah. I would expect to
- 16:50see half or more. So
- 16:52cognitive biases, of course, this
- 16:53is like your anchoring bias.
- 16:54This is recency bias.
- 16:56This has been a major
- 16:57trend in medical education over
- 16:58the last fifteen, twenty years.
- 17:00The problem is that when
- 17:01we study these things experimentally,
- 17:03teaching people about cognitive biases
- 17:05does not make them more
- 17:06less likely to have cognitive
- 17:08biases. I can teach you
- 17:09all about anchoring bias. You
- 17:11are still going to be
- 17:12anchor. You you are still
- 17:13going to anchor. You may
- 17:14have the metacognition that you're
- 17:15doing that, but you can't
- 17:17stop. And why? Like, sometimes
- 17:19I'm pretty amazed. Like, look
- 17:20where we are. Look at
- 17:21this technology. I am like
- 17:23a relatively hairless ape that
- 17:26evolved over fifty million years
- 17:27ago on the plains of
- 17:28Africa. Our brains are not
- 17:29evolved to work in this
- 17:31highly technical world. So it
- 17:33is no not surprising that
- 17:34there are cognitive biases. These
- 17:35are the shortcuts that we
- 17:37evolved with.
- 17:38It it's just part of
- 17:39being human, so that doesn't
- 17:41work. Number two is education
- 17:43about debiasing strategies. This is
- 17:45very popular and kind of
- 17:47optimistic because it works. So
- 17:48if you think about deliberative
- 17:50practice, the literature on deliberative
- 17:51practice,
- 17:52you know, this would be
- 17:53the Malcolm Gladwell ten thousand
- 17:55hours,
- 17:57even though that has been
- 17:59somewhat debunked.
- 18:00But, like, those psychological studies,
- 18:03if you were to spend
- 18:04five hours every single week
- 18:06going over
- 18:07every single pathologic diagnosis you
- 18:09made, reflecting on what you
- 18:10could have done better, seeing
- 18:11how the patients did, you
- 18:13will get better at the
- 18:14diagnostic process. This has been
- 18:16studied in internists.
- 18:18It works.
- 18:19No one has five hours.
- 18:21And if you were to
- 18:21spend five hours doing that,
- 18:23your bosses would get mad
- 18:24at you for not closing
- 18:25out enough cases or for
- 18:26me, like, my my length
- 18:27of stay would go up.
- 18:28Like, there's our system does
- 18:30not allow for these debiasing
- 18:32strategies as effective as they
- 18:33are. I mean, to this
- 18:34day, I keep a follow-up
- 18:35list. I have ten patients
- 18:36on it because that's all
- 18:37I can put because I
- 18:38don't have time. And whenever
- 18:39I put a new patient
- 18:40on, I have to take
- 18:40an old patient off. I
- 18:41spend maybe thirty minutes a
- 18:43week. I don't even know
- 18:44if it makes me that
- 18:44much better, but at least
- 18:46I try. And even then,
- 18:47it's really hard to do
- 18:49that. So that brings us
- 18:50to number three,
- 18:52which, doctor Homer mentioned, AI.
- 18:54We've been talking about AI
- 18:55for a long time.
- 18:57So artificial intelligence has been
- 18:59the third historical and actually
- 19:01the oldest way to make
- 19:02humans better. So this is
- 19:04where,
- 19:06doctor Homer was talking about
- 19:07nineteen seventy nine. I assume
- 19:09they were talking about internist
- 19:10one probably,
- 19:11but we've been talking about
- 19:12this for over a century.
- 19:14So this is the oldest
- 19:15quote I've ever found about
- 19:17artificial intelligence in medicine from
- 19:18Bernard Shaw. So talking about
- 19:20reflecting on how, like, no
- 19:22one does math anymore in
- 19:23in bit, like, old fashioned
- 19:24spreadsheets, paper where we had
- 19:25to do all this math.
- 19:26By the early twentieth century,
- 19:27there were counting machines that
- 19:28did all that. And he
- 19:29says, in the clinics and
- 19:30hospitals of the near future,
- 19:31when they quite reasonably expect
- 19:33that the doctors will delegate
- 19:34all the preliminary work of
- 19:35diagnosis to machine operators.
- 19:38And then he says, the
- 19:38observation of the symptoms is
- 19:40extremely fallible, depending not only
- 19:42on the personal condition of
- 19:43the doctor who has possibly
- 19:44been dragged to the case
- 19:45by his night bill after
- 19:46an exhausting day. I love
- 19:47that that also hasn't changed
- 19:49in over a century. But
- 19:50upon the replies of the
- 19:51patient to questions which are
- 19:52not always properly understood and
- 19:54for lack of the necessary
- 19:55verbal skill could not be
- 19:56properly answered if they were.
- 19:58From such sources of error,
- 19:59machinery is free. So when
- 20:01I talk about artificial intelligence,
- 20:02we are not talking about
- 20:03anything new. These ideas have
- 20:05been around for a long
- 20:06time. And depending on how
- 20:08you define a diagnostic AI,
- 20:11the first artificial intelligence in
- 20:12medicine was made about six
- 20:14months after the first electronic
- 20:15computer was made. So we've
- 20:17been working on this for
- 20:18a long time.
- 20:19Historically,
- 20:20AI clinical decision support
- 20:22has had really big impacts,
- 20:24though, in limited domains. One
- 20:26of my favorite examples is
- 20:27AAP help. This is, sometimes
- 20:29called the Leeds abdominal pain
- 20:30score is for patients presenting
- 20:31with an acute abdomen.
- 20:33A multicenter RCT showed a
- 20:35mortality
- 20:36decrease in acute abdomens of
- 20:38twenty two percent.
- 20:39That is huge. It's a
- 20:41multicenter RCT. It was so
- 20:43important that the US Navy
- 20:45considered it an essential part
- 20:46of our nuclear shield because
- 20:47they put it on all
- 20:48the, the submarines with ICBMs.
- 20:50Because if a semen had
- 20:50an acute abdomen, you needed
- 20:52to know whether to take
- 20:52that submarine and attack that
- 20:54semen, in which case part
- 20:55of your nuclear nuclear umbrella
- 20:56went away. And then, of
- 20:57course, there's, Internus One. I
- 20:59don't know if anyone here
- 21:00knew Jack Myers, but Internus
- 21:01One is one of the
- 21:02coolest AI systems of all
- 21:03time. It was modeled on
- 21:04the brain of Jack Myers,
- 21:06sometimes called so his people
- 21:07who knew him, Black Jack.
- 21:09He is the reason we
- 21:10no longer have oral boards
- 21:11in medicine. He's an eidetic
- 21:12memory chair of medicine at
- 21:13Pitt. But an AI system
- 21:15that was based on his
- 21:15brain at the year nineteen
- 21:16eighty two could solve the
- 21:18New England Journal of Medicine
- 21:19clinical pathological conference better than
- 21:21any human could. So not
- 21:23new technologies.
- 21:24Now
- 21:25one of the sad things
- 21:26is that the technology really
- 21:27stagnated. It worked really well.
- 21:30A lot of these computational
- 21:31strategies worked really well in
- 21:32narrow
- 21:33categories,
- 21:34but from roughly nineteen ninety
- 21:36to the year twenty twelve,
- 21:38there wasn't really any change
- 21:40in the performance. There's actually
- 21:41a review from twenty twelve
- 21:42that shows that differential generators
- 21:44in twenty twelve worked exactly
- 21:45as well as those that
- 21:46had come twenty years before.
- 21:47So real stagnation of the
- 21:49computational techniques.
- 21:50Now
- 21:52language models are the most
- 21:54exciting technology in the clinical
- 21:56reasoning space to come out
- 21:57in
- 21:59forty years, basically since the
- 22:00early nineteen eighties. I'm I'm
- 22:02going to briefly explain
- 22:04what they are, how they
- 22:05work, and why they might
- 22:06be exciting before actually demonstrating,
- 22:09like, how it works on
- 22:10a real case. So a
- 22:12language model, if, anybody, I
- 22:14should say, has everybody, who
- 22:15here has not used a
- 22:16language model before?
- 22:19Oh. A couple people. So
- 22:20most everyone, I imagine, has
- 22:21at least put something into
- 22:23ChatGPT before?
- 22:24Yeah. I I think they're
- 22:26pretty ubiquitous. I I believe
- 22:27ChatGPT was the most downloaded
- 22:29app in history. So, presumably,
- 22:30most people have. So how
- 22:32how does a language model
- 22:33work? So it is a
- 22:35it's a type of neural
- 22:35network. It's actually a transformer.
- 22:37But, effectively,
- 22:38it's autopredict on steroids. So
- 22:41let's say I'm sure everyone
- 22:43has done this. You go
- 22:43to Google. You type in
- 22:44pathologists are, and it predicts
- 22:46a bunch of different words.
- 22:47If I type in internal
- 22:48medicine physicians are, it'll probably
- 22:49say, like, nerdy or awkward
- 22:51or maybe it'll say smart,
- 22:53but it has a lot
- 22:53of predictions on what the
- 22:54next word is based on
- 22:56the tens of thousands of
- 22:57searches that have come before.
- 22:58A language model takes that
- 22:59fundamental technology
- 23:01and puts it on steroids.
- 23:03So it takes
- 23:04we don't not all of
- 23:05human text, but virtually all
- 23:07of human text. We don't
- 23:08know what's in the training
- 23:08corpus of the latest foundation
- 23:10models, but we do know
- 23:11for GPT three class models,
- 23:12which is all of the
- 23:13Internet,
- 23:14a lot of pirated books,
- 23:16a lot of human textual
- 23:17material. We also know from
- 23:18news reports that these companies
- 23:19are doing things like scraping
- 23:21YouTube videos, scraping podcasts. They're
- 23:23hungry for textual data. So
- 23:26this is the script, by
- 23:26the way. Huge nerd here,
- 23:27internal medicine physician. Shocker. This
- 23:29is a script for Star
- 23:30Trek two, the wrath of
- 23:31Khan, which is because I'm
- 23:32a Trekkie, but it's also
- 23:33because GPT two had a
- 23:34very famous experiment where you
- 23:36trained it only on the
- 23:37scripts from Star Trek because,
- 23:38again, the people who built
- 23:39these models were nerds, and
- 23:41someone actually built a dataset
- 23:42only on scripts of Star
- 23:43Trek, which I love.
- 23:44So
- 23:45it breaks every
- 23:47single,
- 23:49word, every single piece of
- 23:50text into units called tokens.
- 23:52And a token is a
- 23:53piece of a word. It
- 23:55is a basic semantic unit
- 23:57of a language model.
- 23:58And just like in the
- 23:59example where you type in
- 24:00pathologist r and it predicts
- 24:02the next word, it predicts
- 24:03the next token based on
- 24:04this huge training set of
- 24:06information.
- 24:07What makes it a large
- 24:08language model is that it's
- 24:10based on a technology called
- 24:11a transformer, which doesn't just
- 24:13predict the next token,
- 24:14but it it predicts the
- 24:15next token in a vector
- 24:16string of tokens, which means,
- 24:18semantically, it's predicting the next
- 24:20word in the context of
- 24:21a sentence, in the context
- 24:22of a paragraph, in the
- 24:23context of an entire book.
- 24:25And because these are so
- 24:26large and computationally intensive, it's
- 24:29doing these these calculations. You
- 24:30know, you go to Google,
- 24:31you type it once, that's
- 24:32one. It's doing that four
- 24:33hundred billion times per parameter.
- 24:36So massive amounts of calculations
- 24:38to figure out what the
- 24:40next token in a string
- 24:41should be.
- 24:42And now
- 24:44I, as somebody well, I'm
- 24:45losing my voice. I guess
- 24:46I'm not surprised. I've been
- 24:47talking nonstop for the last
- 24:48twenty four hours. As someone
- 24:49who studies human reasoning,
- 24:51humans are both, like, wonderful
- 24:52creatures, and we're
- 24:54we're terrible sometimes. Right? So
- 24:57we wrote the declaration of
- 24:58the rights of man. We
- 24:59wrote the, the Torah. We
- 25:00wrote the Bhagavad Gita. We
- 25:02also wrote Mein Kampf and
- 25:03the website four Chan, and
- 25:04all of those are encoded
- 25:06in language models. So the
- 25:08final step that makes these
- 25:10models there is a pretraining
- 25:11step that I won't go,
- 25:11but the final step that
- 25:12makes them so eerily human
- 25:14is
- 25:14reinforcement learning through human feedback.
- 25:16Sometimes it's called fine tuning.
- 25:18There's many ways to fine
- 25:19tune, though. But the idea
- 25:20here is that a human
- 25:21being actually sits down and
- 25:22talks a language model and
- 25:24says, oh, it's a Skinner
- 25:25box. Right? You did good
- 25:26language model, you get a
- 25:27cookie, or you did bad,
- 25:28you get an electric shock.
- 25:29It's a one or a
- 25:30zero, but same idea.
- 25:31And then over time, through
- 25:33all these training cycles, it
- 25:34makes them remarkably human. So
- 25:36a fun fact about the
- 25:37word delve, everybody knows that
- 25:38ChatJPT loves to say, let's
- 25:39delve into that. Right? And
- 25:41there was a great study
- 25:41that was just published, well,
- 25:43about a year ago that
- 25:44showed that in the scientific
- 25:45literature, the word delve was
- 25:46almost never used, and now
- 25:47it's incredibly used because everyone's
- 25:48using ChatJPT to help write
- 25:50their papers. Well, why is
- 25:51delve in there? It turns
- 25:53out that OpenAI
- 25:54used contract workers in Nigeria
- 25:56and Kenya to do most
- 25:57of their RLHF.
- 25:58And DELV sounds weird in
- 25:59American English, but in Kenyan
- 26:01and Nigerian English, DELV is
- 26:03commonly used. It's not doesn't
- 26:04sound weird to them. So
- 26:06now DELV is a big
- 26:07part of American scientific literature
- 26:10because of the RLHF of
- 26:12a language model of contract
- 26:13workers in Kenya and Nigeria.
- 26:15So just one of those
- 26:16fun things. And, of course,
- 26:18what you get is something
- 26:19that is remarkably human. You
- 26:20know, I tell Chat GPT
- 26:22I'm doing community theater. I
- 26:23wanna do the wrath of
- 26:24Khan. And it says, farewell,
- 26:25noble admiral. Hold not your
- 26:26breath. The enterprise cannot save.
- 26:28She's marked for death. From
- 26:29the sky, soon shall be
- 26:30torn asunder, a fiery end,
- 26:31a final echoing thunder. And
- 26:33Shakespeare. Right? So he still
- 26:34goes, Khan.
- 26:36So I did just do
- 26:37this so I could do
- 26:37this meme. You're welcome. But
- 26:39to also point out that
- 26:41the technology that is underneath
- 26:43this, a transformer,
- 26:44can be used to do
- 26:46output of any human creative
- 26:47output. This is just DALL
- 26:49E three. It's captain Kirk,
- 26:50at the Globe Theater. DALL
- 26:51E three is what's called
- 26:52a diffusion model. There are
- 26:54now video models. If anyone
- 26:55has seen Sora, that's the
- 26:56OpenAI video model. There's this
- 26:58crazy thing. Like, there are
- 26:59video games that are diffusion
- 27:00video games, meaning that you,
- 27:02like, walk through a video
- 27:03game, and it renders
- 27:04every single scene from an
- 27:06AI, like, hallucination. And then,
- 27:08of course, most concerningly, you
- 27:09can clone people's voices really
- 27:11effectively now, about ninety seconds
- 27:13of audio, and you can
- 27:14sound like anybody. And as
- 27:15someone who has, like, thirty
- 27:17hours of podcasting out, someone
- 27:18could easily clone me. And
- 27:19so if I call you
- 27:20up, don't trust that it's
- 27:21me, especially if I'm trying
- 27:22to get your Social Security
- 27:23number.
- 27:24Okay. Why does this matter
- 27:25for diagnosis?
- 27:27Because that's cool. Right? I,
- 27:29I've told a couple people
- 27:30this because I was working
- 27:31I I got into all
- 27:32of this research because I
- 27:33was working on my second
- 27:34book, which was about clinical
- 27:35reasoning. And I actually used
- 27:36I talked to a bunch
- 27:37of data scientists before. I
- 27:38had used GPT three before,
- 27:41like, CATGPT was released. And
- 27:43I thought it was stupid,
- 27:44and it wasn't gonna accomplish
- 27:45anything, and I was not
- 27:46impressed at all. So
- 27:47with that caveat, maybe you
- 27:49shouldn't listen to anything that
- 27:50I say. But, like, why
- 27:51do language models appear to
- 27:53work in diagnosis? So this
- 27:54is a real case of
- 27:55mine. This is a patient
- 27:56in whom I made a
- 27:57diagnostic error.
- 27:58The patient died.
- 28:00And what I will say
- 28:01is that the patient was
- 28:03like, I am friends with
- 28:04his wife. I have permission
- 28:05from his family. He was
- 28:06also a huge trekkie. So
- 28:07this patient had night sweats,
- 28:10monocytosis,
- 28:11a daily fever, ground glass
- 28:12opacities on the X-ray. He
- 28:13had been treated for bladder
- 28:14cancer, like, four or five
- 28:15years before, BCG.
- 28:18He did not improve with
- 28:19antibiotics, which is when I
- 28:20met him in the hospital.
- 28:21He still had the fevers
- 28:22despite high dose antibiotics.
- 28:23Liver enzymes were crazy. And
- 28:24then he had spinal hardware
- 28:26in his back from previous
- 28:27surgeries. And,
- 28:29I know, about two weeks
- 28:30before JPT four was released,
- 28:32I got back from the
- 28:33state lab what he had
- 28:34actually died from. So this
- 28:35poor man, we thought he
- 28:35had culture negative endocarditis.
- 28:37He died from a freak
- 28:38case of m bovis bacteremia,
- 28:40presumably reactivated from his BC
- 28:42BCG treatment. Quite rare,
- 28:45but misdiagnosis.
- 28:46It wasn't just me. Like,
- 28:47I was the the hospitalist
- 28:48with the residents. Like, there
- 28:49were a lot of doctors
- 28:50involved. But, you know, one
- 28:51of the first things that
- 28:51I did when GPT four
- 28:53came around is I I
- 28:54asked for a second opinion
- 28:56based on my problem representation
- 28:57at the time. And what
- 28:59happens is that it says,
- 29:01you know, the the number
- 29:02one diagnosis was what the
- 29:03patient died from, and the
- 29:04number two diagnosis was what
- 29:06I was wrong about.
- 29:08So, you know,
- 29:09this is me in twenty
- 29:11twelve, and my first thought
- 29:12is what if I had
- 29:13had this technology six months
- 29:14ago?
- 29:15Would this have changed anything?
- 29:17And we actually don't know
- 29:18the answer because second opinions,
- 29:20we haven't there's not great
- 29:21studies of second opinions, at
- 29:23least in internal medicine. We
- 29:24do know
- 29:25that there are discrepancies that
- 29:27being quite large, these are
- 29:28pathological diagnoses, so these are
- 29:30actually pathology studies.
- 29:32There's only one good perspective
- 29:33study on this from the
- 29:34Netherlands, which did find, like,
- 29:37frequent switching of diagnoses
- 29:39and improved symptoms. The patients
- 29:41actually did better when they
- 29:42got a second opinion.
- 29:43But, you know, the data
- 29:45is pretty early. So how
- 29:47do language models presumably do
- 29:49a good job at diagnosis?
- 29:51Well,
- 29:52our we've done some cool,
- 29:53like, ablation studies. Like, I
- 29:55my job is really cool.
- 29:56I basically gotta do psychology
- 29:57on both humans and machines,
- 29:58but we can do ablation
- 29:59studies where we knock out
- 30:00parts of the language model
- 30:01to figure out what's going
- 30:02on. And what it appears
- 30:04to happen is is that
- 30:05the reason that language models
- 30:06can make diagnoses is because
- 30:07of their similarity
- 30:09to how human brains make
- 30:10diagnoses. Right? So if you
- 30:11think about token prediction, you
- 30:13think about the log probs
- 30:14of different words,
- 30:16that's what a script is.
- 30:17So LLMs are basically
- 30:20system one on steroids, and
- 30:22they encode far more knowledge
- 30:24than we do.
- 30:25They are not perfect, but
- 30:27it does appear that this
- 30:28is what gives them their
- 30:29remarkable abilities in diagnosis.
- 30:31So with that, wait, who
- 30:33is who is my volunteer
- 30:34internist for this?
- 30:36Yeah.
- 30:38That's it. You work on
- 30:39Yelp?
- 30:40So I am going to
- 30:42show an example of
- 30:44how this technology works.
- 30:46Oh, jeez.
- 30:47So no. No. Five years.
- 30:48It's I'm sorry. No. This
- 30:51is, so I'll just explain
- 30:53what we're doing. So right
- 30:54now, if you go to
- 30:54the BI and you get
- 30:56admitted to the emergency room
- 30:57and you meet certain inclusion
- 30:58or exclusion criteria, you will
- 30:59be pulled into our data
- 31:00pipeline
- 31:01to study the effect of
- 31:04different,
- 31:06second opinions at different what
- 31:07we call diagnostic touch points.
- 31:09Because one of the ideas
- 31:10is when you evaluate these
- 31:11things, they,
- 31:13it depends on the information
- 31:14density. Right? Real diagnosis, you
- 31:16don't get a case vignette.
- 31:17You're often operating in
- 31:19poor information settings. So what
- 31:21we are doing here is
- 31:22I am going to, walk
- 31:24you through this is all
- 31:25real material. I've stripped it
- 31:26of PHI. This is a
- 31:27real patient.
- 31:28Just so you know, it's
- 31:29not a I intentionally just
- 31:31chose a random patient. This
- 31:32is not a zebra. Okay.
- 31:34This is not like a
- 31:34CPC. So what I want
- 31:36you to do is this
- 31:37is the information that was
- 31:39available to the port emergency
- 31:40room resident, and I want
- 31:41you to tell me what
- 31:42your thinking is, what you
- 31:44would wanna do, and then
- 31:45I'm going to ask the
- 31:46model, and you can tell
- 31:47me how that changes your
- 31:48thinking.
- 31:49Okay. So I I'll read
- 31:50it out loud if you
- 31:51want even though I'm losing
- 31:52my voice. So this is
- 31:53a a young woman who
- 31:54walks into our emergency room.
- 31:56This is the ED triage
- 31:57note. So this is someone
- 31:58this history is taken not
- 31:59in the emergency room, but
- 32:00in the waiting room, and
- 32:01a nurse has taken this.
- 32:03So,
- 32:04chief complaint, they have to
- 32:05put in a ICD code.
- 32:07So this is chest pain,
- 32:08tachycardia.
- 32:09The triage history, patient reports
- 32:10a new PE diagnosis and
- 32:11left lower extremity DBT at
- 32:13an outside hospital five days
- 32:14ago. Put an Eliquis ten
- 32:15BID since four days ago.
- 32:17Patient now arrives here with
- 32:18worse than chest pain, cough,
- 32:19and tachycardia.
- 32:20The nurse appropriately put this
- 32:22to the highest severity level,
- 32:24one, which means that the
- 32:25doc she goes back immediately
- 32:26and the doctor sees her.
- 32:27And vitals, fever, one zero
- 32:29one. Heart rate, one forty.
- 32:30Reperatory rate, twenty six. Blood
- 32:32pressure and o two sats
- 32:33are fine.
- 32:35K.
- 32:38So you want me to
- 32:39Yeah. Just say what you
- 32:39would what you would do.
- 32:42Okay. So,
- 32:43let's get a chest x-ray.
- 32:46Let's get an eye
- 32:49Let's get a chest x-ray.
- 32:51Let's
- 32:51get,
- 32:56I'm worried she is a
- 32:58I'm worried she still has
- 32:59a PE. Yeah. I'm worried
- 33:00she has either,
- 33:02I'm worried she has a
- 33:03new infection now for some
- 33:04reason, and I'm worried she
- 33:05has, like, a,
- 33:07like, a dissection possibly. Correct.
- 33:10So so the top your
- 33:11top worries are new PE
- 33:13or worse than PE infection,
- 33:15something like an aortic dissection.
- 33:20Me and Max don't get
- 33:21along.
- 33:23Okay.
- 33:25So one of the questions
- 33:26that I always get when
- 33:27people come to my lab
- 33:28is they're like, can I
- 33:28see the AI? Because they
- 33:29think there's something really cool
- 33:30when you see an AI.
- 33:31The reality is that it's
- 33:32a it's an Excel spreadsheet.
- 33:34It's a JSON database. So
- 33:36it's not exciting. So what
- 33:37I'm doing here this is
- 33:38the model I showed you
- 33:38earlier. Is this is not
- 33:40the model we're using. In
- 33:41reality, I'm using a llama
- 33:42three model. This is o
- 33:43one, which is the latest
- 33:45and greatest model from OpenAI
- 33:47that really freaks me out.
- 33:48But, we'll see what it's
- 33:50going to show. And to
- 33:51be clear, I haven't run
- 33:52this three zero one yet,
- 33:53so I don't know what
- 33:54it's going to show.
- 33:55Okay.
- 33:56So
- 33:58wow. It's going fast.
- 33:59It agrees with you that
- 34:01number one on this differential
- 34:02is a worsening pulmonary recurrent
- 34:03or worsening pulmonary embolism. Infection
- 34:05is the number two. It's
- 34:06picking up that fever. Could
- 34:07this be a pneumonia? Number
- 34:09three, it's now considering also
- 34:10pericarditis.
- 34:11Could pericarditis be going on?
- 34:13Fair enough.
- 34:15ACS, it's considering.
- 34:17Again, infection,
- 34:19and then subtherapeutic
- 34:20anticoagulation,
- 34:22pneumothorax.
- 34:22These are all very unlikely
- 34:24or COVID nineteen infection. And
- 34:25it also wants very similar
- 34:27things to you. It wants
- 34:27an EKG, a chest X-ray.
- 34:28It wants a repeat CTA.
- 34:30This is this is its
- 34:31management plan. What what do
- 34:32you does any of this
- 34:33change your thinking?
- 34:35No.
- 34:37Is it helpful?
- 34:39It's It's okay to say
- 34:41no. I mean, it just
- 34:42kind of confirms
- 34:43what I was
- 34:45thinking anyway, I guess. Yeah.
- 34:46So it's it's confirmatory. It
- 34:47makes you more confident in
- 34:48what you were thinking.
- 34:51Okay. On to the next
- 34:52aliquot. So this is very
- 34:54unique to our workflow. So
- 34:55what happens next is the
- 34:57poor ED resident who presumably
- 34:58has like twenty other patients
- 35:00has to come immediately to
- 35:01see this patient because it's
- 35:02an ESI of one.
- 35:03The ED resident, she writes,
- 35:05patient is a young female
- 35:06presented with the ED for
- 35:07chest pain and worsening shortness
- 35:09of breath. Patient notes that
- 35:10she has a history of
- 35:11lupus, was diagnosed with a
- 35:12PE five days ago after
- 35:13having a CT for shortness
- 35:14of breath and chest pain,
- 35:15started on Eliquis. I think
- 35:17all of this is the
- 35:17same. Today, she had worsening,
- 35:19shortness of breath, chest pain,
- 35:20worsening palpitations, which prompted her
- 35:22to present to the ED.
- 35:23And then she was triggered
- 35:24at all of this, we
- 35:25already know. So this is
- 35:26all other stuff. She's a
- 35:26history of lupus. She's a
- 35:28history of lupus. So that's
- 35:29and that this just started
- 35:30today are the additional things
- 35:32that the ED resident picked
- 35:33Does any of that change
- 35:34your thinking?
- 35:35I'm worried
- 35:36now more that she's got,
- 35:38like, a cons like, a
- 35:38constrictive pericarditis
- 35:41or
- 35:42but the the PE is
- 35:43still number one. Yeah.
- 35:45So you're considering other things,
- 35:46but it hasn't really changed
- 35:48yet.
- 35:50Okay. Let's see. Oops.
- 35:56You would think I know
- 35:57how to use a Mac.
- 35:58It's embarrassing.
- 36:05So what I love yeah.
- 36:07Turn this off. What I
- 36:08love about this is you
- 36:09can actually see what the
- 36:10model itself is thinking. So
- 36:11the model, again, similar to
- 36:12me, the fact that it
- 36:13started on just one day
- 36:14means that something like, that's
- 36:16acute.
- 36:17So it's changing its thinking
- 36:19also. Let's see. Did it
- 36:20pick up on the lupus
- 36:21here? Yes.
- 36:24Ah, now that's this is
- 36:25why I like seeing its
- 36:26thinking because you can see
- 36:27it's like, okay. A clot
- 36:29in lupus, could it be
- 36:30antiphospholipid antibody syndrome? So still,
- 36:33like you, it thinks that
- 36:34a recurrent or worsening PE
- 36:35is still the number one
- 36:36thing on the differential. It's
- 36:37now considering antiphospholipid
- 36:39antibody syndrome on the differential,
- 36:41which I think is reasonable
- 36:42given that additional history of
- 36:44lupus.
- 36:45Pericardial fusion. So what you
- 36:47were mentioning, could this be
- 36:48pericardial fusion? It's actually worried
- 36:50about tamponade. I don't know
- 36:51why because the blood pressure
- 36:52is normal, but it mentions
- 36:53that. Blood pressure is normal.
- 36:55I so no hypotension.
- 36:57Arrhythmia, pericarditis,
- 36:59still considering ACS and pneumothorax.
- 37:02It does put panic attack.
- 37:03And then what it wants,
- 37:04I don't think it's changing
- 37:05what it wants. It wants
- 37:06an EKG, a CTPA, a
- 37:07TTE, chest x-ray, all the
- 37:09standard things.
- 37:10Does this change like, did
- 37:12this
- 37:13second opinion change your thinking
- 37:14at all?
- 37:16Really? Does it make you
- 37:17feel more confident or not?
- 37:20Yes.
- 37:22Yeah. It makes me feel
- 37:23yeah. I guess it makes
- 37:24me feel more confident.
- 37:26It's okay to say nothing.
- 37:28Priest health care is last
- 37:29week.
- 37:31Well, that's I mean, the
- 37:32we get we'll get into
- 37:33this. Right? So if you
- 37:34were to give AI generated
- 37:36second opinions even if very
- 37:37effective, it might just lead
- 37:38to overtreatment of everything.
- 37:40So a lot of what?
- 37:41We'll talk we'll talk about
- 37:42that after. Because this is
- 37:43a this is a big
- 37:44concern about when and in
- 37:46what situation you should do
- 37:47this. Okay. Exam isn't gonna
- 37:49help you any. I'll put
- 37:50it in the system, but
- 37:51this is the ED documented
- 37:52exam. K. They document it
- 37:54as completely normal. Okay. I
- 37:56this is not one of
- 37:57my like, literally, this is
- 37:58a random patient that
- 38:01I picked out. So I
- 38:02have no idea if it
- 38:03was actually normal. I'm just
- 38:04gonna put that in there
- 38:05so it knows. And then
- 38:05we'll move on to the
- 38:06next piece of information, which
- 38:07is the imaging. So
- 38:09the resident
- 38:10orders actually pretty much everything
- 38:12that you ask for. They
- 38:14also order an EKG. It
- 38:15is a problem in my
- 38:16data pipeline that I'm not
- 38:17able to pull in EKGs
- 38:18yet, so there is no
- 38:18EKG here. But, X-ray shows
- 38:20bibasilar opacity is compatible with
- 38:22small bilateral pleural effusions.
- 38:26CTA,
- 38:27right,
- 38:28lower,
- 38:29right lower lobar and segmental
- 38:31PE without right heart strain
- 38:32and a small pericardial effusion,
- 38:34and then this bilateral
- 38:36axillary lymph nodes,
- 38:38as well as this hypodensity
- 38:40in the liver. TTE is
- 38:41performed. Big picture, there's no
- 38:43strain seen on the TTE.
- 38:45No tamponade.
- 38:46But so, yeah, those images,
- 38:47does that change your thinking
- 38:48at all or pretty much
- 38:49where you are?
- 38:51It it I mean, it
- 38:53knocks down pneumonia
- 38:54a little bit. It knocks
- 38:56down, like
- 38:57she's sounds like her cardiopulmonary
- 38:59silhouette is normal, so, like,
- 39:00I'm not worried. Like, she's
- 39:01got a big
- 39:02pleural effusion or anything like
- 39:04that.
- 39:07Yeah. I think
- 39:08I'm still
- 39:09on the on the PE
- 39:11train. You're still on the
- 39:12PE train. Okay.
- 39:14Let me
- 39:16try to not freak out
- 39:17my AI model too much.
- 39:19Oh, why am I scrolling
- 39:21the wrong direction? Embarrassing Adam.
- 39:23But I want an EKG.
- 39:26I don't have it.
- 39:28This is epic. This is
- 39:29a snowflake problem.
- 39:31The EKGs are stored in
- 39:33another database, so they're actually
- 39:34not very easy to pull
- 39:35in until this is the
- 39:36problem. Until the cardiologist confirms
- 39:38the read, and then you
- 39:39can extract it. So because
- 39:40my data pipeline is running
- 39:42live, none of our patients
- 39:43have EKGs.
- 39:46So this is what we
- 39:47were talking about. So much
- 39:48so much of this comes
- 39:48to, like, understanding where the
- 39:49data comes from and the
- 39:50limitations.
- 39:51Okay. So let's see
- 39:54how the AI model has
- 39:55changed its thinking. So like
- 39:57Hume, it still thinks recurrent
- 39:59or worsening PE is the
- 40:00number one diagnosis.
- 40:01It still is worried about
- 40:04anaphosolipid antibody syndrome. But sorry.
- 40:07Can I just say Yeah?
- 40:07Please. Anaphosolipid syndrome is not
- 40:09causing her acute, like
- 40:12I know. So, like, it
- 40:13doesn't like, that's
- 40:15it can say that all
- 40:16at once, and it might
- 40:17be APS, but, like, she's
- 40:18still if she's the APS
- 40:20is causing something. Right. Right.
- 40:22Right. I it doesn't That
- 40:24doesn't help
- 40:25you.
- 40:26Well, it helps you down
- 40:27the line, but not It
- 40:28doesn't help you in the
- 40:29acute setting. Exactly.
- 40:31Pericardial effusion, possibly lupus related
- 40:33infection,
- 40:34pericarditis,
- 40:35arrhythmia.
- 40:36Yeah. None of these these
- 40:36are all pretty much things
- 40:37that were on your differential.
- 40:38Right?
- 40:40None of this changes your
- 40:41thinking at all. Doesn't. Except
- 40:42except to get annoyed because
- 40:44you're like
- 40:45even if it's anti antiphospholipid
- 40:47antibody syndrome, it's still a
- 40:48pee.
- 40:49Okay.
- 40:51Labs,
- 40:52I don't know that this
- 40:53is gonna help you much,
- 40:54but she admit our lactate
- 40:55cutoff is one point six,
- 40:56so this is a slightly
- 40:57elevated lactate. Sodium is one
- 40:59thirty one.
- 41:00These other labs are all
- 41:02relatively normal. A troponin was
- 41:04negative. That's a a negative
- 41:05proBNP.
- 41:06She is anemic. Hemoglobin is
- 41:08nine one.
- 41:09Her INR is elevated one
- 41:11point seven,
- 41:12with a PTT of thirty
- 41:13five.
- 41:14That diff is normal, and
- 41:16then when they repeated the
- 41:17lactic acid after fluids, it
- 41:18was one five, which is
- 41:20one below the cutoff. So
- 41:21it's normal in our system,
- 41:22and the repeat retardant was
- 41:23negative. I'm going to well,
- 41:25did the labs change anything
- 41:26for you? No. Yeah. I
- 41:27I wouldn't think they would.
- 41:29And let's see if they
- 41:30change anything via AI model.
- 41:42Come on. Show me what
- 41:43you're doing.
- 41:47You're running a different chat
- 41:49GPT.
- 41:50This is very slow compared
- 41:51to my experience.
- 41:53This is, a new model
- 41:54called o one. You can
- 41:55see what it's doing. It
- 41:57is using an internalized chain
- 41:58of thought process
- 41:59that, well, that's what it's
- 42:01doing. It's thinking it's think
- 42:02thinking through different steps. So
- 42:04that's why it's going so
- 42:05slow. In reality, we're running
- 42:06this on a core with
- 42:07a llama model that's like
- 42:08it's like that. It's very
- 42:09fast.
- 42:11Okay.
- 42:12So I'm gonna guess it's
- 42:13gonna say the same things,
- 42:14but let's check it out.
- 42:16Recurrent PE,
- 42:18it's it's not changing. Right?
- 42:19So it's basically saying the
- 42:20same thing.
- 42:23I don't think any of
- 42:23these are different.
- 42:25Okay.
- 42:28So I'll just go over
- 42:29the last bit because our
- 42:31final touch point is when
- 42:31medicine sees the patient, so
- 42:33when medicine actually admits the
- 42:34patient. So the, I bolded
- 42:36the relevant thing. So, the
- 42:38ED no. Sorry. The medicine
- 42:40intern sees the patient still
- 42:41in the ED, and the
- 42:42patient says,
- 42:44Oh, this happened to me
- 42:45before six years ago. I
- 42:47had a lupus flare with
- 42:48very similar symptoms,
- 42:49and, the medicine resident finds
- 42:51out that the patient
- 42:53outpatient rheumatologist
- 42:54for the last couple weeks
- 42:55has felt that she's having
- 42:56a lupus flare and has
- 42:57been modifying her medications
- 42:59and that these are the
- 43:00current medications. So methotrexate,
- 43:02twenty q seven days, which
- 43:03has just been increased, plaquenil,
- 43:05folic acid, and then apixaban,
- 43:07which is new.
- 43:08Does any of that
- 43:11change
- 43:13your
- 43:16thinking?
- 43:18It's okay to say no.
- 43:20No. It it it doesn't.
- 43:21It's just you know, I
- 43:22think there's
- 43:24two things going
- 43:27on.
- 43:30Do you want me to
- 43:30tell you what the final
- 43:32Like, I guess my question
- 43:33is, is it pleuritic?
- 43:34Is it pleuritic? Yeah. You
- 43:36wanna know what the final
- 43:36diagnosis was from the the
- 43:37medicine team? Is it is
- 43:39it a lupus flare? It's
- 43:40lupus flare. Yeah. So the
- 43:41the final diagnosis is actually
- 43:43that this the poor woman
- 43:44has pericarditis and pleuritis from
- 43:46a lupus flare and had
- 43:47a PE secondary to that.
- 43:49So it is two different
- 43:50things going on. And, you
- 43:52know, the ED team did
- 43:54everything completely appropriate. Right? When
- 43:56you have a patient like
- 43:56this come in, obviously, you
- 43:57wanna make sure they're not
- 43:58having something devastating. So it's
- 43:59not even that this patient
- 44:00was mismanaged
- 44:02in any way.
- 44:03It's just not what the
- 44:05initial diagnosis was. So I
- 44:06was more curious.
- 44:07Come on.
- 44:09It's very slow. It's being
- 44:10very finicky. So I'm curious
- 44:12what the AI model is
- 44:13going to end up saying.
- 44:15You can see how slow
- 44:16it is. This is all
- 44:17the different steps.
- 44:20Well
- 44:21Friction
- 44:23I did not see this
- 44:25patient. This is all mediated
- 44:27through the physical through the
- 44:28chart, and I will tell
- 44:29you that in our data
- 44:30pipeline, which pulls in only
- 44:31two different diagnostic
- 44:33physical exams, no one documented
- 44:35it, that doesn't mean she
- 44:36didn't have it.
- 44:40I highly doubt that the
- 44:41EP resident did the exam
- 44:42later. Okay. So let's see
- 44:43what the model says.
- 44:45Okay. So this is endo
- 44:47it it this is the
- 44:48right final diagnosis, which had
- 44:49lupus flare with psoriositis.
- 44:51And then the number two,
- 44:52the patient has, it's a
- 44:54pulmonary embolism. And, of course,
- 44:55what we found out or
- 44:57I'm not this patient's doctor,
- 44:58but what ended up happening
- 44:59is they get the out
- 45:00the outside CTA, and it
- 45:01shows that, in fact, the
- 45:03PE is no larger. It's
- 45:04even a little bit smaller.
- 45:05So this is not a
- 45:05recurrent PE. The symptoms were
- 45:07likely driven from pericarditis and
- 45:09pleuritis, so lupus serositis.
- 45:12The patient was worked up
- 45:13for antiphospholipid antibodies from all
- 45:14the tests were negative. So
- 45:16that's the that's the final
- 45:17diagnosis. So reflecting back, like,
- 45:19would this have been helpful
- 45:20if you were getting this
- 45:21and would have driven you
- 45:22in the wrong direction?
- 45:25I don't know if it
- 45:26would have driven me in
- 45:27the wrong direction, but it's
- 45:30it was confirmatory.
- 45:31But I mean,
- 45:33I think
- 45:35well, two things. One, we're
- 45:36in an ed setting. So,
- 45:38you know, I think the
- 45:39most important thing is ruling
- 45:41out the things that are
- 45:41gonna kill her in the
- 45:42next hour
- 45:44so, you know
- 45:46It might be a lupus,
- 45:47but you don't want to
- 45:48miss right you don't want
- 45:48to miss a pe or
- 45:49a dissection
- 45:51I guess I don't know
- 45:51if that's anchoring, but, like,
- 45:53that is my top things
- 45:54are I wanna make sure
- 45:55she's not tamponading. I wanna
- 45:57make sure she's not having
- 45:58a massive PE. I wanna
- 45:59make sure she's not having,
- 46:00like, a dissection
- 46:02or she's having, like, a
- 46:03big heart attack. So, like,
- 46:05other than that,
- 46:06it didn't really change anything.
- 46:08Because you would have done
- 46:09everything exactly the same. And
- 46:10you were considering those cannot
- 46:12miss diagnoses from the very
- 46:13beginning, obviously.
- 46:14Yeah. Yeah. I don't think
- 46:15it would have changed much.
- 46:17Would it have driven you
- 46:18in the wrong direction? Right?
- 46:18Would getting a second opinion
- 46:20like this have made you
- 46:21second guess yourself or I
- 46:23think it would have made
- 46:24me order more tests.
- 46:26And in in particular, like,
- 46:28in the ED, you would
- 46:28have ordered a bunch of
- 46:29those tests?
- 46:31Maybe not in the ED,
- 46:32but, like, it was I
- 46:33saw it. It was, like,
- 46:34get a cardiac MRI.
- 46:36I hope no one well,
- 46:37it wouldn't matter. Cardiology wouldn't
- 46:38do the cardiac MRI, but,
- 46:39yes, that is not an
- 46:41appropriate test for this workout.
- 46:42Yeah. But,
- 46:43yeah, maybe order more I
- 46:45I might have ordered more
- 46:46labs if I'm being honest.
- 46:47Yeah.
- 46:48Alright. Well, that's thank you
- 46:50very much. I'll give you
- 46:51a a hand. I was
- 46:52very
- 46:53sorry that I made you
- 46:54breakfast in general medicine again.
- 46:56You thought you escaped.
- 46:59No. That that's oops. Well,
- 47:01that's cool that you have
- 47:02a, one of these things
- 47:03here. So that's, I mean,
- 47:04this is an example
- 47:06of what it looks like
- 47:08in practice in a randomly
- 47:09selected case. And you can
- 47:10start to already see when
- 47:11you go through it some
- 47:12of the challenges of implementing
- 47:14a system like this. I'm
- 47:15gonna go over some of
- 47:16the data, including some of
- 47:17the new data before, seeing
- 47:18if anybody has any questions
- 47:19and before I lose my
- 47:20voice.
- 47:22So LLMs encode lots of
- 47:23knowledge.
- 47:24I'm sure that everyone saw
- 47:25that, you know, it can
- 47:26pass the USMLE.
- 47:27I don't care about this.
- 47:29You guys should not care
- 47:30about this either. It turns
- 47:31out, this is actually from
- 47:32some of our interesting ablation
- 47:33studies, that LLMs'
- 47:35performance on exams
- 47:37has less to do
- 47:38what they're doing is that
- 47:40they have learned the semantic
- 47:41structure of multiple choice questions,
- 47:43meaning that they are effectively
- 47:45good test takers.
- 47:46Some of my colleagues did
- 47:47a really cool experiment where
- 47:48they made up two new
- 47:49organ systems, and then they
- 47:50had test writers write up
- 47:51multiple choice questions with those
- 47:52fake organ systems. And the
- 47:54LLMs still did really well
- 47:55on it because they learned
- 47:56to understand what a question
- 47:58looks like and guess the
- 47:59right answer from that. And
- 48:00I think everyone here knows
- 48:01that if you're being honest
- 48:02with yourself about what multiple
- 48:03choice like, you start by
- 48:05excluding a couple things. We
- 48:06all know how that works.
- 48:07So none of that matters.
- 48:09This empathy thing, I think,
- 48:10is overplayed also. You should
- 48:12know that
- 48:13this is the justification for
- 48:15having LLMs write portal messages,
- 48:16that patients find their communication
- 48:18more empathetic.
- 48:19The standard, of course, is
- 48:21compared to a very overstretched
- 48:23PCP who's just trying to
- 48:24commute
- 48:25communicate your CBC results. So
- 48:26these are not actually empathy
- 48:28in person
- 48:29communications, but at least in
- 48:30written communications, people do find
- 48:32the LLM to be more
- 48:33empathetic.
- 48:34For what I care about,
- 48:36LLMs are able to make
- 48:37diagnoses
- 48:38on a lot of the
- 48:40benchmarks
- 48:41that, like, the field has
- 48:42accepted. LLMs have long since
- 48:44surpassed humans,
- 48:46but a lot of these
- 48:47are relatively artificial because they're
- 48:48very information dense settings, very
- 48:50complicated diagnoses,
- 48:52and a lot of when
- 48:53you look at the diagnostic
- 48:54errors are not coming from
- 48:56lupus nephritis. It's coming from
- 48:57people misdiagnosing
- 48:59common things.
- 49:01They have this is fascinating
- 49:02they have an emergent probabilistic
- 49:04reasoning,
- 49:05so there's no reason that
- 49:07semantic, like language should give
- 49:09you a probabilistic understanding of
- 49:11disease states, But in fact,
- 49:12this is studied with Dan
- 49:13Morgan. When you compare it
- 49:14to large groups of humans,
- 49:16they have a better sense
- 49:17of understanding the pretest probability
- 49:19of disease and how that
- 49:20changes with subsequent tests. That
- 49:22holds up
- 49:23pretty well. The post test
- 49:25probability of disease, it's not
- 49:27really any better than humans,
- 49:28but for after a positive
- 49:30test, but after a negative
- 49:30test, it's a lot better
- 49:31than us.
- 49:33They can forecast similarly well.
- 49:34So if you ask it,
- 49:36what do you think the
- 49:36percentage chance of the final
- 49:38diagnosis is? This this was
- 49:39done in neurologists, ID doctors,
- 49:41and pediatricians.
- 49:42It outperforms every single individual
- 49:44human in every single group
- 49:46of humans,
- 49:47only being beaten when you
- 49:48take the best groups and
- 49:49put them together.
- 49:52They are able to display
- 49:54reasoning.
- 49:55So when it comes to,
- 49:56like, how will you communicate
- 49:58with a human,
- 49:59there's this whole question of
- 50:00human computer interaction. What would
- 50:02an AI
- 50:06LLMs actual cases, as new
- 50:09information comes in and asks
- 50:10them to update their thinking
- 50:11and you compare that to
- 50:12humans, it outperforms humans consistently.
- 50:16It outperforms
- 50:17attendings who outperform residents, and
- 50:18there's no difference in efficiency,
- 50:20accuracy, quality, or cannot misdiagnoses.
- 50:22It does hallucinate more. So
- 50:24this is actually pretty high
- 50:25hallucination. Right? Right? It makes
- 50:26up stuff twelve percent of
- 50:27the times compared to only
- 50:28three percent of the times
- 50:29with humans. Now the hallucinations
- 50:31are relatively minor. Some of
- 50:32them are kind of funny.
- 50:33One of them was a
- 50:34patient who had diverticulitis,
- 50:36and the L. M. Wanted
- 50:37the human to keep gastroenteritis
- 50:39in mind because the patient
- 50:40had recently traveled to, Texas,
- 50:42and going to Texas was
- 50:43a risk factor for enterotoxigenic
- 50:45E. Coli, which I'm pretty
- 50:47sure is not true. So
- 50:48that is a hallucination,
- 50:49but probably not one that
- 50:51would harm the patient.
- 50:52This study was done by
- 50:53my colleagues at Google,
- 50:55very controversial when it came
- 50:56out because what they did
- 50:58is it's actually not a
- 50:59very high performing model. It's
- 51:00a palm two model, but
- 51:01the model itself could solve
- 51:02CPCs. You can see humans
- 51:03are on the bottom. It
- 51:04could solve clinical pathological conferences,
- 51:06and they randomized humans to
- 51:09either solve conferences themselves
- 51:11using Google search, using the
- 51:12AI model, or the AI
- 51:13model itself. And this was
- 51:15controversial when it came out
- 51:16because when you gave humans
- 51:17the AI model, it actually
- 51:18made the model not perform
- 51:19as well. So adding humans
- 51:21into the mix lowered performance.
- 51:23This is against
- 51:24the kind of standard precepts
- 51:26of the informatics field, so
- 51:27quite controversial when it came
- 51:28out. Unfortunately,
- 51:30I I we my group
- 51:31ran a large randomized control
- 51:33trial looking at very nuanced
- 51:34measures of reasoning in real
- 51:35cases, so not CPCs,
- 51:38and we found the same
- 51:38thing. The human performance is
- 51:41on the right by itself
- 51:42in blue.
- 51:43Humans using the AI are
- 51:44in green, and the AI
- 51:45model by itself is in
- 51:46red. So we found the
- 51:47same thing, and because we
- 51:49did a very nuanced measure,
- 51:50I can tell you why.
- 51:51And it's humans, when the
- 51:53AI model tells them that
- 51:54they're wrong, disregard those pieces
- 51:55of information. In particular,
- 51:57humans don't like
- 52:00the humans don't like an
- 52:01AI model critiquing
- 52:02or disconfirming the things that
- 52:04they think.
- 52:06Another randomized controlled trial that
- 52:08was just accepted, this is
- 52:09the one I just just
- 52:09heard from nature.
- 52:10So this is in management
- 52:12decisions. Management decisions are notoriously
- 52:14tricky to measure.
- 52:15In this one, LLMs
- 52:17did improve people's ability when
- 52:18they used it to make
- 52:19management decisions, when you randomized.
- 52:21But when we look at
- 52:21the subgroup, it's not what
- 52:23you would think. Like, people
- 52:24aren't using the LLM to
- 52:25say, what is the right
- 52:26dose of apixaban, or even
- 52:27should I give apixaban? What
- 52:28the LLM did was cue
- 52:30them to, for example, apologize
- 52:32after making a medical error
- 52:33or communicate better with other
- 52:35providers or take patient factors
- 52:37into account when following a
- 52:38likely cancerous nodule. So it
- 52:40actually improved performance
- 52:41not in, like, what we
- 52:42think of as the standard
- 52:43management
- 52:45domains, but in things that
- 52:46we think of humans are
- 52:47good at.
- 52:49A lot of the work
- 52:49that I'm doing with Google
- 52:50is on building models that
- 52:51can collect data. This is
- 52:53from the Omni system.
- 52:55This is standardized patients, not
- 52:57real patients, but a true
- 52:58Turing test where a
- 53:00standardized patients talk to a
- 53:01a terminal, and they don't
- 53:02know whether it's a human
- 53:03or an AI on the
- 53:04other side. And,
- 53:06on twenty six of twenty
- 53:07six patient domains, patients before
- 53:09the AI, and in twenty
- 53:11eight of thirty two axes
- 53:12from the physician graders,
- 53:14AI was preferred,
- 53:15and this held up in
- 53:16every single diagnostic category. So
- 53:19we're running this in clinical
- 53:20trials in human like in
- 53:21actual patients now, It's still
- 53:22performing quite well, but they're
- 53:24increasingly able to collect data.
- 53:27Now
- 53:28the, the the unpublished data
- 53:30that I'm about to show
- 53:31you is from my grad
- 53:31student that I put in
- 53:32this presentation this morning because
- 53:34the models continue to improve.
- 53:36And if you would ask
- 53:36me six months ago, I
- 53:37would say we're seeing convergence
- 53:39of model performance. I don't
- 53:40think there's going to be
- 53:41like, there's still gonna be
- 53:43incremental improvements,
- 53:44but this is for, this
- 53:46is for
- 53:47solving CPC. So this is
- 53:48one of these benchmarks that
- 53:49goes back almost sixty years.
- 53:51And the new models have
- 53:52surpassed everything that came before,
- 53:54and you can see humans
- 53:55are in brown in the
- 53:55bottom. This is the one
- 53:56that freaks me out because
- 53:58this is not an HCI
- 53:59study, right? This is just
- 54:00looking at the human baseline,
- 54:01but I showed you these
- 54:02are real cases for the
- 54:03diagnostic and management decisions.
- 54:05The colors are different, but
- 54:06these are the old graphs,
- 54:07and this is the new
- 54:08model on the left. And
- 54:09you can see that the
- 54:10new models are performing
- 54:13in
- 54:13a far better than any
- 54:15other system, not only much
- 54:16better than the humans, but
- 54:17better than the previous AI
- 54:18systems. So we're continuing to
- 54:20see performance gains.
- 54:22Eric Horvitz, the Microsoft group
- 54:24published their, med prompt follow-up
- 54:26on o one today, and
- 54:27they actually they came to
- 54:28the same conclusion as my
- 54:29paper, which is, like, these
- 54:30things have gotten so good
- 54:31that we need new benchmarks.
- 54:34We or clinical trials because
- 54:35they are outperforming everything that
- 54:37we throw at them.
- 54:39I can go over these
- 54:40quickly. In reality, so a
- 54:42lot of tools are now
- 54:43being used in clinical practice.
- 54:45They're actually kind of underperforming
- 54:46from what we were sold.
- 54:47So if you look at
- 54:48some of the early performance
- 54:49of AI,
- 54:51scribes, which I know you
- 54:52guys are using here at
- 54:53Yale and some of the
- 54:54clinics, some of the early
- 54:55studies actually suggest there is
- 54:56no efficiency gain because they
- 54:58hallucinate, and the doctors have
- 54:59to go back and check
- 55:01the models.
- 55:02And, yeah, people like it,
- 55:03but it's not really saving
- 55:05anybody time and, you know,
- 55:07people always care about money.
- 55:08It's not saving anybody any
- 55:09money either.
- 55:10The same thing is happening
- 55:12with the
- 55:13patient portal messaging. One of
- 55:15the very depressing things from
- 55:16the JAMA study on this
- 55:17is that it actually took
- 55:18more time, seven percent more
- 55:20time, physician time, when the
- 55:22AI wrote the initial drafts
- 55:23because it hallucinates or says
- 55:25something harmful, and the doctor
- 55:26has to go back and
- 55:27edit it. Again, the patients
- 55:28liked the responses more, but
- 55:29it took the doctors more
- 55:30time.
- 55:31And then what everybody should
- 55:33know, I don't think I
- 55:34need to say this, but
- 55:34LLMs are racist and sexist.
- 55:36They actually encode
- 55:38because they are trained on
- 55:39our language and then fine
- 55:40tuned by humans. They encode
- 55:41all of the biases that
- 55:43humans have. Now they do
- 55:45appear I I just published
- 55:46a study in, JAMA. They
- 55:47appear to be less racist
- 55:49and sexist than us, but
- 55:50they are still racist and
- 55:52sexist.
- 55:52So in a world where
- 55:54we're trying to get past,
- 55:55like, race based medicine,
- 55:57especially as LLMs get more
- 55:58and more powerful, we should
- 55:59know that they are showing
- 56:01human, not only cognitive biases,
- 56:03but racial and gender biases,
- 56:04which is concerning.
- 56:06And then we talked a
- 56:08little bit, but the you
- 56:08know, HCI is actually quite
- 56:10challenging because
- 56:12if used inappropriately, this technology
- 56:14probably will drive overtreatment.
- 56:16It
- 56:17different people need different opinions
- 56:19at different times. Like, a
- 56:20second opinion is not universally
- 56:21helpful.
- 56:22Also, HCI is unpredictable. There's
- 56:24a great study from some
- 56:24of my colleagues at MIT
- 56:26that showed that the best
- 56:27radiologists
- 56:28actually have their performance lowered
- 56:30by a high performing AI
- 56:31because they second guess themselves.
- 56:33So just because an AI
- 56:34model works well, even if
- 56:36it consistently works well, like,
- 56:38in silica, that in silico,
- 56:40doesn't mean that it's actually
- 56:41going to improve human performance
- 56:42because, you know, again, we
- 56:44are hairless apes that evolved
- 56:45to, like, live on it'd
- 56:47be hunter gatherers, and now
- 56:48we're trying to do complex
- 56:49medicine in the twenty first
- 56:50century. So,
- 56:52whew, I'm gonna lose my
- 56:53voice. That is it for
- 56:54this presentation. So if anybody
- 56:56has any questions or wants
- 56:57to talk about pathology, I
- 56:58am happy to, entertain them.
- 57:05And thank you very much.
- 57:08Are the new models based
- 57:10on the performance of the
- 57:11previous models formed from Zetta
- 57:13testing? So the new model,
- 57:15Owen, this is really interesting,
- 57:16has no new data in
- 57:18it. There is it is
- 57:19the same
- 57:20data as Ford Turbo. So
- 57:21the cutoff is like last
- 57:23year. So what in is
- 57:24improving its performance has nothing
- 57:25to do with the training
- 57:26data, but what they're doing
- 57:27is chain of thought. So
- 57:29if you get a model
- 57:29to speak out at SOTA,
- 57:30it does better. And what
- 57:31they've done is reinforcement learning
- 57:33for the chain of thought.
- 57:34So they're teaching it how
- 57:35to think out loud and
- 57:36then reinforcing that over time.
- 57:38So these models, these are
- 57:39all computational techniques. It has
- 57:40nothing to do with the
- 57:41underlying data, and there's no
- 57:43more scale. The parameters of
- 57:44the model are exactly the
- 57:45same, which is one of
- 57:46the reasons I'm so freaked
- 57:47out because I didn't think
- 57:47we could get such impressive
- 57:49performance gains without increasing the
- 57:51number of parameters.
- 57:54Yes.
- 58:12Yeah. Yeah. So this is
- 58:13a great question. Like,
- 58:15what would it look like
- 58:16in practice?
- 58:17So
- 58:18I'm assuming
- 58:19we're talking Epic here. Right?
- 58:21So the the reality of
- 58:23the situation is that Epic
- 58:24is working on clinical decision
- 58:26support software.
- 58:28This is not a huge
- 58:29priority. If you look at
- 58:30what Epic is working on,
- 58:31they're mostly efficiency. They're working
- 58:33on like tech summarization.
- 58:34However,
- 58:35Epic does make it fairly
- 58:37easy to have a data
- 58:38pipeline to put information in.
- 58:40So even at my own
- 58:41institution, we have a pipeline
- 58:41through Amazon Web Services that
- 58:41I can push a second
- 58:42opinion in through the chart,
- 58:42trivially
- 58:43second opinion in through the
- 58:44chart, trivially easy. Like, the
- 58:45any any health system could
- 58:46do this. Any third party,
- 58:47there are vendors right now
- 58:48who want to sell you
- 58:49this technology. No one should
- 58:50buy it, by the way,
- 58:52because this is not tested,
- 58:56and I'm pretty certain that
- 58:57it will lead to worse
- 58:58care if used, like, routinely
- 59:01on every single patient. So
- 59:02from a technological standpoint, you'd
- 59:04need, like,
- 59:06fifteen hours of a programmer's
- 59:07time to build a pipeline
- 59:08to do this. The question
- 59:10becomes, like, what are the
- 59:11other strategies that you're going
- 59:12to do to make sure
- 59:13that you're giving a second
- 59:14opinion to the right person
- 59:15at the right time? At
- 59:16the BI and what we're
- 59:17doing through the, Home Run
- 59:19Network is we're looking at
- 59:20serving second opinions at clinical
- 59:22decompensation. So at the moment
- 59:23that a patient's about to
- 59:24go to the ICU, based
- 59:26on this logic that the
- 59:27patient is already really sick,
- 59:30we do diagnostic timeouts anyway.
- 59:32So this is just another
- 59:33part of the diagnostic time
- 59:34out. But, so a lot
- 59:35of our work is like
- 59:36looking at audit logs, trying
- 59:37to get a sense on
- 59:38which patients or which providers
- 59:39need second opinions, and that's
- 59:40much more computationally intense.
- 59:43My guess is, like, in
- 59:44five years, Epic will just
- 59:45build this into Epic.
- 59:48Yes.
- 59:50So I mean, the question
- 59:51is, how do I state
- 59:53of,
- 59:54let's say, current out as
- 59:55well?
- 59:56Is it capable of creative
- 59:58information? And if not,
- 01:00:00is it accurate to study?
- 01:00:03You are asking the right
- 01:00:04questions. So okay. What I'm
- 01:00:05gonna say is controversial.
- 01:00:08LLMs,
- 01:00:10large language models, codify human
- 01:00:11knowledge.
- 01:00:12They are and there's actually
- 01:00:14there are computational tests to
- 01:00:15test their ability to be
- 01:00:16creative for things outside of
- 01:00:17their training set. I do
- 01:00:19not think there is any
- 01:00:20reason to think that any
- 01:00:21large language model will ever
- 01:00:23be able to be creative
- 01:00:25in that outside of its
- 01:00:26training set. They are effectively
- 01:00:29locking in human knowledge. Now
- 01:00:30they can be updated, and
- 01:00:31they can read things and
- 01:00:33integrate that new knowledge, but
- 01:00:34they're still fundamentally limited by
- 01:00:36what's in their training set.
- 01:00:37And that gets to, like,
- 01:00:38well, what are the impacts
- 01:00:39for medicine? The fact of
- 01:00:40the matter is when it
- 01:00:40comes to diagnosis,
- 01:00:42ninety nine, ninety eight times,
- 01:00:44we're not being creative, but
- 01:00:45sometimes that's necessary. And what
- 01:00:47does this do for human
- 01:00:48creativity? I mean, everyone's seen
- 01:00:49this. When you work with
- 01:00:50an LLM, you have it
- 01:00:51write something. It's very average.
- 01:00:53It's very milquetoast.
- 01:00:54It literally is picking out
- 01:00:56the average of its training
- 01:00:57set. That's actually one of
- 01:00:58the reasons it works well
- 01:00:59in diagnosis, but there's gonna
- 01:01:00be downstream effects and the
- 01:01:02lack of creativity is one
- 01:01:03of them.
- 01:01:07Well, you mean in science
- 01:01:08or in These are advanced.
- 01:01:10Any that is that we
- 01:01:11touch. I mean, this is
- 01:01:12this is a very real
- 01:01:13concern.
- 01:01:15Yeah. You're not wrong.
- 01:01:17I
- 01:01:18I think this is is
- 01:01:20that is that depressing? I'm
- 01:01:21sorry. You're maybe looking for
- 01:01:22a more optimistic answer there.
- 01:01:24LLMs
- 01:01:25will, I don't think, ever
- 01:01:26be capable of creativity in
- 01:01:28the way that a human
- 01:01:29is.
- 01:01:32Oh, I have all the
- 01:01:32time in the world.
- 01:01:34That's not true. But
- 01:01:36What's sometimes specifically the physical
- 01:01:39exam? You know, how can
- 01:01:40we be doing a diagnostic
- 01:01:42differential
- 01:01:43based on the exam?
- 01:01:45And and this is a
- 01:01:46bigger question. I got really
- 01:01:47the way,
- 01:01:49the best side of,
- 01:01:51schedules that we need,
- 01:01:53or
- 01:01:55errors. And there's also implications
- 01:01:57for what we should be
- 01:01:57teaching our students.
- 01:01:59A physical exam, in terms
- 01:02:01of collecting data is not
- 01:02:02something that an LLM can
- 01:02:03do now.
- 01:02:05Multimodal models this is why
- 01:02:06I'm telling you pathology everyone
- 01:02:08who confidently predicts that pathology
- 01:02:09is going to be computerized,
- 01:02:11that technology is way ways
- 01:02:12away. Multimodal models do not
- 01:02:14perform very well.
- 01:02:16And
- 01:02:17maybe five to ten years
- 01:02:18from multimodal models being able
- 01:02:20to perform at a human
- 01:02:20level. But so in the
- 01:02:22interim, an accurate physical exam
- 01:02:24becomes incredibly important. And you
- 01:02:26saw that exam that the
- 01:02:27student that the resident put
- 01:02:28in there. That's a templated
- 01:02:29exam.
- 01:02:31I mean, probably the resident
- 01:02:32did an appropriate exam for
- 01:02:34someone who they thought might
- 01:02:35have a dissection or PE,
- 01:02:36but she documented just the
- 01:02:38templated exam and that can
- 01:02:40throw off a language model.
- 01:02:41So when it gets to,
- 01:02:42like, what are the things
- 01:02:43that are uniquely good at
- 01:02:44humans in a world where
- 01:02:45we're working more and more
- 01:02:46with AI models,
- 01:02:47good observational skills and learning
- 01:02:50how to do those skills
- 01:02:51and then
- 01:02:52accurately
- 01:02:53represent them is really important.
- 01:02:55I think we'll call it
- 01:02:56a day. We'll get Adam
- 01:02:58a chance to do a
- 01:02:58Phew. Yeah.