Pathology Grand Rounds, November 7, 2024

Name: Pathology Grand Rounds, November 7, 2024
Uploaded: 2024-11-08T16:35:07.5366667Z
Duration: 1 h 3 min
Description: Adam Rodman, MD, MPH, FACP, of Harvard Medical School and Beth Israel Deaconess Medical Center, presents on, "Towards an AI Second Opinion: Clinical Reasoning Large Language Models and How We Might Make Humans a Little Better."

November 08, 2024

Adam Rodman, MD, MPH, FACP, of Harvard Medical School and Beth Israel Deaconess Medical Center, presents on, "Towards an AI Second Opinion: Clinical Reasoning Large Language Models and How We Might Make Humans a Little Better."

Information

ID: 12333
To Cite: DCA Citation Guide

Download Transcript

00:00Hi, everybody. Welcome to Grand
00:02Pathology Grand Rounds.
00:04I'm Rob Homer.
00:06I am exceptionally pleased to
00:08invite someone,
00:09to speak to.
00:11It's, Adam Rodeman's, has a
00:13dual degree in medicine public
00:14health from Tulane,
00:16trained in internal medicine at
00:17Oregon Health and Science University,
00:20the global health fellow at
00:21the Beth Israel Deaconess, where
00:23he actually wound up working
00:24in Botswana for a couple
00:25of years and wound up
00:26developing a curriculum for, interns
00:29in Botswana, which is great.
00:31He's currently assistant professor of
00:33medicine at Harvard Medical School
00:34and attending internist at Beth
00:36Israel Deaconess.
00:38So I became familiar with
00:39Adam from his,
00:41well known podcast
00:43on bedside rounds,
00:44which is phenomenal because it
00:46goes into the history of
00:47medicine. And I have to
00:48say that if you wanna
00:50understand, like, every time you
00:51look around and says, why
00:52are things like this? It
00:53turns out there frequently is
00:54a reason, and it's frequently
00:55found in history. And it
00:57was like, oh, that's why
00:58we do this. And so
00:59it really provides a lot
01:00of insight, and I surely
01:01recommend it. Also, I have
01:03to say, at dinner last
01:05night, I found Adam just
01:06enthusiasm. If we could just
01:08harness that, we don't need
01:09nuclear. We don't need solar.
01:11We just need Adam. It's
01:12phenomenal.
01:14He is,
01:15having his history. So he's
01:17trans he transformed from history
01:19to
01:20AI. How did that happen?
01:21Because there's a history of
01:22machine learning in medicine, which
01:23goes back a long way.
01:24When I was going to
01:25medical decided to go to
01:26medical school when I was
01:27in college, one of my
01:28classmates said, why are you
01:29doing this? Computers are gonna
01:30take this over any day
01:31now. And that was in
01:32nineteen seventy nine. So a
01:34little maybe it's true, but
01:36it just took a little
01:37while.
01:39His research founded, funded by
01:41the Macy Foundation,
01:42Gordon Betty Moore Foundation, National
01:44Academy of Medicine, explores integration
01:46of large language models and
01:47complex diagnostic challenges, which you're
01:49gonna hear today,
01:50associate editor for the New
01:52England Journal of Medicine AI,
01:53published work in JAMA Open
01:55Network, JAMA Internal Medicine, New
01:56England Journal of Medicine, among
01:57other journals.
01:59But
02:00and, actually, you know, the
02:01rationale for really being here,
02:03he's planning the innovation of
02:04digital in education and medicine,
02:06codirects at the EyeMed program,
02:08which is dedicated to exploring
02:09digital education.
02:10He is among other things,
02:12he has a digital education
02:13track for future educators to
02:15learn about,
02:16electronic education
02:18technology.
02:19Currently leader of a task
02:20force to integrate, into the
02:21Harvard Medical School undergraduate
02:23medical education.
02:25Received numerous teaching awards,
02:27and mentorship awards, including most
02:29recently the Herman Blumgard faculty
02:31teaching award.
02:32Join me in welcoming Adam.
02:34I'm really excited to hear
02:35this. Thank you. This is
02:36gonna be really cool. I'm
02:37a hi, everybody. I'm Adam.
02:41Okay. Can you guys hear
02:42me? Oh, good. I'm a
02:43pacer. Before I start
02:45oh, I'm so sorry. If
02:46you're an internal medicine physician,
02:48could you raise your hand?
02:51I've had a feeling I
02:52was gonna pick on you
02:52guys. I would like actually,
02:54so
02:55at
02:56about twenty minutes in, I'm
02:58going to demonstrate
02:59our actual clinical workflows.
03:01If one of the internists
03:02would be willing it's not
03:04a competition, but to actually
03:05use the AI to one
03:07of the real patients from
03:07our studies, would you be
03:08willing to?
03:11Yeah. Okay.
03:12So hi. My name is
03:13Adam. I, I'm not a
03:15pathologist,
03:16so I apologize in advance.
03:18I am I don't know.
03:19I don't know who's nerdier,
03:20but in general, internists are
03:21some of the nerdiest physicians
03:22alive.
03:23I got a nod over
03:24there. So
03:26I research how people think.
03:28And in pathology
03:29and when it comes to
03:30machine learning in particular, there's
03:32really been a focus
03:33unfairly, I think, on, like,
03:35classification models. Right? Just what
03:37is this image? Which I
03:38think, well, not I think.
03:40I know underplays the actual
03:41cognitive processes that go in
03:43pathology, which is why everyone
03:45still here has a job
03:46and is in no danger
03:47of losing a job anytime
03:48soon. Now what I'm going
03:50to talk about today is
03:51my research, but I'm gonna
03:53try to make it entertaining
03:54to talk about
03:56language models,
03:57reasoning, multimodal models, and what
04:00the data in here, I'll
04:01just have, you know, I
04:02was talking to my PhD
04:03student this morning. I have
04:04updated this slide with data
04:05that has not been published
04:06yet that was run yesterday.
04:07So this is literally
04:09the,
04:10latest breaking data here. But
04:13to give you an idea
04:13of where we've come already
04:15and where we're gonna
04:16go. So
04:17I wanna start with this
04:18idea that is relevant to
04:20basically every medical specialty, and
04:21it's what is diagnosis? What
04:23does it mean to make
04:25a diagnosis? This is what
04:27we all do. This is
04:27theoretically what internists do, though
04:29often we don't get a,
04:30we don't get a definitive
04:31diagnosis. This is what we
04:32do in pathology.
04:34Basically, all of medicine
04:36is not all of medicine,
04:38but it's focused on this
04:39idea of diagnosis. Even when
04:40we're talking downstream about
04:42management, it
04:44usually
04:45relies on the diagnostic process.
04:47But what does it mean
04:48to make a diagnosis? So
04:50there's a great paper. I
04:51always say it was just
04:52published. It was actually published
04:52about fifteen years ago. I'm
04:53just getting old and unstuck
04:55in time, but there's a
04:56great paper that looks at,
04:59like, what do we mean
04:59when we see diagnosis? What
05:00do we mean when we
05:01say clinical reasoning? And unfortunately,
05:04as I'm sure everybody here
05:06knows, there's not a standard
05:08definition and everybody is talking
05:09about something a little bit
05:11different, which means we have
05:12a tendency to talk past
05:13each other. So to give
05:15you a sense on where
05:16the science stands, it's complicated.
05:18Yeah. I'm gonna do some
05:19Simpson stuff in here. So
05:20I realized that as a
05:21geriatric millennial that that dates
05:23me quite considerably.
05:24But at its most basis,
05:26basic dating back to the
05:27ancients, diagnosis is a classification
05:30task. Like, what is nosology?
05:31Nosology is the way that
05:32we categorize
05:33diseases.
05:34And fundamentally, making a diagnosis
05:36means, okay. I have to
05:38come up with a single
05:39disease or multiple diseases in
05:41my differential out of this
05:42large classification
05:44schema. Well, that sounds easy.
05:45I, I think what's funny
05:47is if you go back
05:48to some of the early
05:49literature, like this is from
05:50the fifties on diagnosis, they
05:51literally thought it was this
05:52easy. Everybody here knows it's
05:54much more complicated than that.
05:57So modern researchers into the
05:58nature of diagnosis have focused
06:01more on human psychology.
06:03So, RIP Danny Kahneman,
06:05if anybody has read Thinking
06:07Fast and Slow lately, I'm
06:08I'm sorry. I'm gonna very
06:10quickly summarize it. I will
06:11also say I actually recently
06:12reread Thinking Fast and Slow
06:14just, like, two months ago.
06:15It is I don't recommend
06:16reading it. Danny Kahneman, brilliant
06:18person,
06:18very droll
06:20writer.
06:21Again, RIP Danny Kahneman. So
06:23our understanding of modern diagnosis
06:25is, you know, based on
06:26cognitive psychology.
06:28So this idea
06:30that
06:31most of the process is
06:32this very fast, very contextual
06:35automatic system one process. So
06:37system one, very classically, I
06:39I great example is you're
06:40driving on the highway going
06:41home. All of a sudden
06:43you stop paying attention. Fifteen
06:44minutes go by. You're still
06:45on the highway. Everything's going
06:47well. What happened? Your brain
06:49went into these very automatic
06:51thought processes and that's our
06:52classic system. One,
06:54we talk about heuristics or
06:56mental shortcuts often system one
06:58gets maligned, right? We spend
06:59a lot of time talking
07:00about how system one can
07:01lead us wrong. But the
07:02fact of the matter is
07:03system one exists. We evolved
07:05this way for a reason.
07:06It's fast, it's efficient, and
07:07it works pretty well.
07:09System two are the very
07:10slow, very contextual,
07:13very labor intensive, and theoretically
07:16less biased, not in reality,
07:17but theoretically less biased kind
07:19of formal thought processes.
07:21So what does this look
07:22like in process? Don't worry.
07:23I'm not gonna go over
07:24this. This is, from Pat
07:25Cross Gary who I've talked
07:26about a a bit over
07:27the last day. So this
07:28is, Pat Cross Gary's grand
07:30unified
07:31theory of clinical reasoning. Again,
07:33don't worry about it, but
07:34the gist is
07:35we cross over between system
07:36one and two. What does
07:37this look like in practice?
07:39So I think probably everybody
07:40has read or is aware
07:41of Judith Bowen's very famous
07:43paper from the early two
07:44thousands, which is based on
07:45the work of Bourdage about
07:46a decade before. But our
07:48current understanding,
07:50cognitive psychological understanding of the
07:52diagnostic process is something called
07:54script theory. And this is
07:55a knowledge encoding and knowledge
07:57activation theory.
07:59So
08:00script theory, how does it
08:01work? The idea is that
08:02we know things about our
08:03patients. We know things about
08:05path pathology from a variety
08:07of sources. Medical school, of
08:09course, reading journal articles, and
08:10more importantly, from seeing patients
08:12and from learning, from our
08:13practice, we get feedback. And
08:15all that knowledge gets encoded
08:17in our brains, but we
08:18are not a library. We
08:19are not the Dewey decimal
08:21system. I do not say,
08:23heart failure with risk reduction
08:24fraction
08:25is coded at e twenty
08:27three dot o five. That's
08:28not the Dewey decimal system.
08:29That's like another classification system
08:31that I forget. I spend
08:32too much time in academic
08:33libraries. But, no, how does
08:34that information get encoded? And
08:36it gets encoded
08:38in these things that in
08:39medical education and the psychology
08:41world we call scripts. Scripts
08:43are a psychological principle actually
08:44from the nineteen seventies. It
08:46comes from this idea of,
08:48there are stereotyped
08:49patterns of behavior. So for
08:50example, you walk into a
08:51restaurant,
08:52you have a script that
08:54you follow, and when you
08:55deviate from that script, it
08:56freaks people out. And if,
08:57for example, you were to
08:58go to your friend's house
08:59for dinner and you try
08:59to follow the restaurant script,
09:00it would be incredibly rude.
09:02So, you know, in the
09:03early nineties,
09:04psychologists who study clinical reasoning
09:06realized we're doing the same
09:07thing with diseases.
09:09So the idea here is
09:10that when we
09:12activate information about a disease,
09:14what we're really doing is
09:15telling a story to ourselves.
09:16And that story has, obviously,
09:18the presentation.
09:19That story has the pathological
09:21diagnosis, what that might look
09:23like. It has other, you
09:25know,
09:26the the treatments, all of
09:27that is organized together. And
09:29then more importantly, scripts do
09:30not exist
09:32in, isolation. It's not like
09:34a library. They exist in
09:35these parallel systems of networks
09:38called schema, where almost by
09:40definition, if you activate one
09:42script, you're excluding another. And
09:44in medical education, we spend
09:45a lot of time, you
09:46know, talking to our students
09:48about semantic qualifiers.
09:50You know, is this acute,
09:51subacute, chronic,
09:53polyarticular, monoarticular? And the reason,
09:55the psychological reason,
09:58this is right out of
09:58bordage, is that fundamentally
10:02those are the ways that
10:03we include or exclude different
10:05diagnoses. This goes back to
10:06a really interesting study by
10:07Arthur Elstein in the late
10:0870s, where at Harvard took
10:10a bunch of medical students,
10:11took a bunch of attendings,
10:12and then had them think
10:13out loud
10:15about what was going on.
10:16And as he expected, as
10:18the dominant theory was, everyone
10:20did the Sherlock Holmes thing.
10:22Everyone was like, okay, this
10:23is my theory. These are
10:24the next tests, the hypothetical
10:26deductive process. But what else
10:28he noted is, okay,
10:30well, the attendings only asked
10:32five questions and they got
10:34the diagnosis right. The medical
10:35students asked thirty questions and
10:37they were far less accurate.
10:39So there's something else going
10:41on than simply following a
10:42hypothetical deductive process. So what
10:45does this look like in
10:46practice?
10:47So I'm in clinic. A
10:48patient comes up to me.
10:49She says what did she
10:50say? I can't breathe. It
10:51started yesterday.
10:53Everyone has had this experience.
10:54I mean, it's a little
10:55bit different with pathology, but
10:56you look at something and
10:57then automatically your mind starts
10:59to sort that information. What
11:01is happening? You are activating
11:02one of these scripts. In
11:04medicine, we talk about a
11:05problem representation, the instantiation of
11:07a problem representation,
11:09which is in this case,
11:10acute dyspnea.
11:11This is a way of
11:12teaching, but it's activating that
11:13part of my mind. This
11:14idea that if somebody tells
11:15me all of a sudden
11:16I can't breathe versus six
11:18months slowly progressive shortness of
11:19breath fundamentally activates different ways
11:22that I think about information.
11:23And then
11:24and everyone has had this
11:25experience. Without thinking, you know
11:28what questions to ask next.
11:30Why? Because they're going into
11:32your script for acute dyspnea.
11:33You are going through that
11:34schema, and you're asking follow-up
11:36questions. So I hypothesis test
11:38COPD.
11:39Hypothesis test pneumonia,
11:41and based on that, further
11:42refine my problem representation
11:45until I get something that
11:47is reasonably close to the
11:48diagnosis or I find something
11:51that doesn't make sense and
11:52I cross over to other
11:54types of metacognitive processes. Now,
11:56I said that I'm going
11:57to go over some of
11:58the, like,
11:59different understandings of clinical reasoning
12:01because I will say that
12:02until,
12:03let's say, about twenty ten,
12:05we had basically agreed in
12:07the reasoning field that script
12:09theory was it.
12:10I think that
12:12it doesn't take much to
12:13think about the problems with
12:15script theory and that it's
12:16not a necessary or sufficient
12:18way to think about how
12:19we make decisions. So my
12:20classic example is right now
12:21I've had about a half
12:22gallon of coffee. So let's
12:23say I'm admitting patients, I'm
12:24well rested, well caffeinated, ten
12:26AM.
12:27My mind is gonna work
12:28differently than when I'm working
12:29at two AM. My pager
12:32goes off every five minutes,
12:33and my epic chat just
12:34keeps going red over and
12:35over and over again because
12:37our minds don't I mean,
12:39they don't work like that.
12:40There are a lot of
12:41ecological
12:42factors that affect our reasoning
12:44process. And this is ecological
12:46psychology is one of the
12:48frames now that we think,
12:49which is that reasoning is
12:50not something that just happens
12:51up here. Reasoning is something
12:53that happens in our bodies,
12:55happens in our environment, and
12:56these environmental factors play a
12:57large part. Again, I don't
12:58think any of this is
12:59controversial.
13:01I will put up this
13:02idea of situated cognition.
13:04This so to say that
13:05something is situated is to
13:06say that it's contingent. These
13:07are all psychological terms. But
13:09the idea here is that
13:10there's no such thing as
13:11neutral decisions. They're all shaped
13:13by our prejudices, by our
13:14life experiences.
13:16And because of that decision
13:17making, per situated cognition
13:20is individualized.
13:22I don't find this very
13:23helpful, like, from somebody who
13:24wants to, you know, make
13:25people better. I don't find
13:26this a very helpful psychological
13:27theory, but it is one
13:28that is out there that
13:29you should know about. And
13:30then one of the big
13:32things that I think is
13:32especially relevant in pathology is
13:34I think about how I
13:35make diagnosis in the real
13:36world. Also, you can tell
13:37this is an AI generated
13:38images because a lot of
13:39these, have, like, weird demon
13:41faces as the model gets
13:43confused and tries to average
13:44Simpsons characters and other things.
13:46But, you know, I might
13:47make a diagnosis, and I'll
13:48pat myself on the back
13:49that maybe I've diagnosed lupus
13:51nephritis and a new case
13:52of lupus. But you know
13:53what? What I'm doing is
13:54I'm looking at a creatinine
13:55trend that was identified by
13:57the patient's PCP, and nephrologist
13:58is an outpatient, and I
14:00might be looking at an
14:00ANA or a double stranded
14:02DNA that was sent much
14:03later, and all of this
14:04is mediated through the electronic
14:05health record. So the reality
14:07of diagnosis, it's not really
14:09collaborative, right? It's not in
14:11the sense that we're in
14:11the same room working together,
14:13but most
14:14diagnosis, I don't know most,
14:16I don't know the percentage,
14:17but a lot of diagnosis
14:18in the year twenty twenty
14:19four is mediated through the
14:20electronic health record happening with
14:22different people at different places
14:24in time.
14:25Okay. So why do we
14:27care so much about the
14:28cognitive processes of reasoning? Like,
14:31why why does anybody give
14:32me money money to study
14:33this, which I wonder myself
14:35also sometimes?
14:36And for the last, let's
14:37say, twenty five years, the
14:39focus has been on this.
14:40Ever since to is human
14:41has come out, there has
14:42been a huge focus on
14:44medical errors and the burden
14:45of medical errors. This is
14:46from David Newman Toker's, most
14:47recent paper. I actually think
14:49that this is what he
14:50pulled out. I think the
14:51more damning thing is that
14:53he says fifteen diseases are
14:54for half of serious harms.
14:55Four diseases accounted for thirty
14:57nine percent of harms in
14:58this study, and those diseases
14:59were such exotic things as
15:00stroke and heart attacks. So
15:03doctors make misdiagnoses that cause
15:04harm all the time. We're
15:06not missing lupus nephritis. Well,
15:08we are, but we're also
15:08missing, like, heart attack and
15:10stroke.
15:11This is from Andy Auerbach's
15:13group. Fantastic
15:14study that looked at
15:17very specific cognitive errors because,
15:19you know, one of the
15:19challenges always been, yeah, there's
15:21there's diagnostic errors. Of course,
15:22they are. But is that
15:23because of our human brains,
15:25or is that because of
15:26other systems factors? I mean,
15:27we all know that
15:29getting outside records can be
15:31difficult. There's all sorts of
15:32interruptions. So maybe it's just
15:34our fragmented health system that's
15:35doing this. And David Newman
15:37Joker, what what he sorry.
15:38Andy Auerbach, what he did
15:40is looked at fourteen different
15:41hospitals across the US, patients
15:43who were admitted to the
15:44hospital, appropriately triaged, meaning that
15:45they didn't decompensate within the
15:47first forty eight hours, and
15:48then either went to the
15:49ICU or died. And in
15:50those patients, a quarter of
15:52them had a diagnostic error,
15:54and a lot of these
15:56were severe. One fifth of
15:57those errors caused severe harm
15:58or death. And when we
16:00looked at though, he looked
16:01at the reasons, the most
16:02common cause was this, errors
16:04in human cognition by far,
16:05almost half, with the second
16:07biggest being in test interpretation
16:08and ordering, which is also
16:09a human cognitive error. So
16:11systems errors were actually much
16:12less common. Mind you, this
16:13is a very specific situation,
16:15so patients in the hospital
16:16who are decompensating, so I
16:17don't know that that's generalizable.
16:19But in this population, cognitive
16:21errors are common.
16:23So
16:24in my field,
16:25which I guess is medical
16:26education, reasoning has usually been
16:27within the med ed field,
16:29but we've looked historically at
16:31ways that we can make
16:32people better.
16:33Three big categories. So first,
16:36everyone here who is my
16:38age or less,
16:40I'm forty,
16:41has probably
16:42received some education about cognitive
16:44biases. Like, actually, raise your
16:46hand if you've ever been
16:46taught about cognitive biases.
16:49Yeah. I would expect to
16:50see half or more. So
16:52cognitive biases, of course, this
16:53is like your anchoring bias.
16:54This is recency bias.
16:56This has been a major
16:57trend in medical education over
16:58the last fifteen, twenty years.
17:00The problem is that when
17:01we study these things experimentally,
17:03teaching people about cognitive biases
17:05does not make them more
17:06less likely to have cognitive
17:08biases. I can teach you
17:09all about anchoring bias. You
17:11are still going to be
17:12anchor. You you are still
17:13going to anchor. You may
17:14have the metacognition that you're
17:15doing that, but you can't
17:17stop. And why? Like, sometimes
17:19I'm pretty amazed. Like, look
17:20where we are. Look at
17:21this technology. I am like
17:23a relatively hairless ape that
17:26evolved over fifty million years
17:27ago on the plains of
17:28Africa. Our brains are not
17:29evolved to work in this
17:31highly technical world. So it
17:33is no not surprising that
17:34there are cognitive biases. These
17:35are the shortcuts that we
17:37evolved with.
17:38It it's just part of
17:39being human, so that doesn't
17:41work. Number two is education
17:43about debiasing strategies. This is
17:45very popular and kind of
17:47optimistic because it works. So
17:48if you think about deliberative
17:50practice, the literature on deliberative
17:51practice,
17:52you know, this would be
17:53the Malcolm Gladwell ten thousand
17:55hours,
17:57even though that has been
17:59somewhat debunked.
18:00But, like, those psychological studies,
18:03if you were to spend
18:04five hours every single week
18:06going over
18:07every single pathologic diagnosis you
18:09made, reflecting on what you
18:10could have done better, seeing
18:11how the patients did, you
18:13will get better at the
18:14diagnostic process. This has been
18:16studied in internists.
18:18It works.
18:19No one has five hours.
18:21And if you were to
18:21spend five hours doing that,
18:23your bosses would get mad
18:24at you for not closing
18:25out enough cases or for
18:26me, like, my my length
18:27of stay would go up.
18:28Like, there's our system does
18:30not allow for these debiasing
18:32strategies as effective as they
18:33are. I mean, to this
18:34day, I keep a follow-up
18:35list. I have ten patients
18:36on it because that's all
18:37I can put because I
18:38don't have time. And whenever
18:39I put a new patient
18:40on, I have to take
18:40an old patient off. I
18:41spend maybe thirty minutes a
18:43week. I don't even know
18:44if it makes me that
18:44much better, but at least
18:46I try. And even then,
18:47it's really hard to do
18:49that. So that brings us
18:50to number three,
18:52which, doctor Homer mentioned, AI.
18:54We've been talking about AI
18:55for a long time.
18:57So artificial intelligence has been
18:59the third historical and actually
19:01the oldest way to make
19:02humans better. So this is
19:04where,
19:06doctor Homer was talking about
19:07nineteen seventy nine. I assume
19:09they were talking about internist
19:10one probably,
19:11but we've been talking about
19:12this for over a century.
19:14So this is the oldest
19:15quote I've ever found about
19:17artificial intelligence in medicine from
19:18Bernard Shaw. So talking about
19:20reflecting on how, like, no
19:22one does math anymore in
19:23in bit, like, old fashioned
19:24spreadsheets, paper where we had
19:25to do all this math.
19:26By the early twentieth century,
19:27there were counting machines that
19:28did all that. And he
19:29says, in the clinics and
19:30hospitals of the near future,
19:31when they quite reasonably expect
19:33that the doctors will delegate
19:34all the preliminary work of
19:35diagnosis to machine operators.
19:38And then he says, the
19:38observation of the symptoms is
19:40extremely fallible, depending not only
19:42on the personal condition of
19:43the doctor who has possibly
19:44been dragged to the case
19:45by his night bill after
19:46an exhausting day. I love
19:47that that also hasn't changed
19:49in over a century. But
19:50upon the replies of the
19:51patient to questions which are
19:52not always properly understood and
19:54for lack of the necessary
19:55verbal skill could not be
19:56properly answered if they were.
19:58From such sources of error,
19:59machinery is free. So when
20:01I talk about artificial intelligence,
20:02we are not talking about
20:03anything new. These ideas have
20:05been around for a long
20:06time. And depending on how
20:08you define a diagnostic AI,
20:11the first artificial intelligence in
20:12medicine was made about six
20:14months after the first electronic
20:15computer was made. So we've
20:17been working on this for
20:18a long time.
20:19Historically,
20:20AI clinical decision support
20:22has had really big impacts,
20:24though, in limited domains. One
20:26of my favorite examples is
20:27AAP help. This is, sometimes
20:29called the Leeds abdominal pain
20:30score is for patients presenting
20:31with an acute abdomen.
20:33A multicenter RCT showed a
20:35mortality
20:36decrease in acute abdomens of
20:38twenty two percent.
20:39That is huge. It's a
20:41multicenter RCT. It was so
20:43important that the US Navy
20:45considered it an essential part
20:46of our nuclear shield because
20:47they put it on all
20:48the, the submarines with ICBMs.
20:50Because if a semen had
20:50an acute abdomen, you needed
20:52to know whether to take
20:52that submarine and attack that
20:54semen, in which case part
20:55of your nuclear nuclear umbrella
20:56went away. And then, of
20:57course, there's, Internus One. I
20:59don't know if anyone here
21:00knew Jack Myers, but Internus
21:01One is one of the
21:02coolest AI systems of all
21:03time. It was modeled on
21:04the brain of Jack Myers,
21:06sometimes called so his people
21:07who knew him, Black Jack.
21:09He is the reason we
21:10no longer have oral boards
21:11in medicine. He's an eidetic
21:12memory chair of medicine at
21:13Pitt. But an AI system
21:15that was based on his
21:15brain at the year nineteen
21:16eighty two could solve the
21:18New England Journal of Medicine
21:19clinical pathological conference better than
21:21any human could. So not
21:23new technologies.
21:24Now
21:25one of the sad things
21:26is that the technology really
21:27stagnated. It worked really well.
21:30A lot of these computational
21:31strategies worked really well in
21:32narrow
21:33categories,
21:34but from roughly nineteen ninety
21:36to the year twenty twelve,
21:38there wasn't really any change
21:40in the performance. There's actually
21:41a review from twenty twelve
21:42that shows that differential generators
21:44in twenty twelve worked exactly
21:45as well as those that
21:46had come twenty years before.
21:47So real stagnation of the
21:49computational techniques.
21:50Now
21:52language models are the most
21:54exciting technology in the clinical
21:56reasoning space to come out
21:57in
21:59forty years, basically since the
22:00early nineteen eighties. I'm I'm
22:02going to briefly explain
22:04what they are, how they
22:05work, and why they might
22:06be exciting before actually demonstrating,
22:09like, how it works on
22:10a real case. So a
22:12language model, if, anybody, I
22:14should say, has everybody, who
22:15here has not used a
22:16language model before?
22:19Oh. A couple people. So
22:20most everyone, I imagine, has
22:21at least put something into
22:23ChatGPT before?
22:24Yeah. I I think they're
22:26pretty ubiquitous. I I believe
22:27ChatGPT was the most downloaded
22:29app in history. So, presumably,
22:30most people have. So how
22:32how does a language model
22:33work? So it is a
22:35it's a type of neural
22:35network. It's actually a transformer.
22:37But, effectively,
22:38it's autopredict on steroids. So
22:41let's say I'm sure everyone
22:43has done this. You go
22:43to Google. You type in
22:44pathologists are, and it predicts
22:46a bunch of different words.
22:47If I type in internal
22:48medicine physicians are, it'll probably
22:49say, like, nerdy or awkward
22:51or maybe it'll say smart,
22:53but it has a lot
22:53of predictions on what the
22:54next word is based on
22:56the tens of thousands of
22:57searches that have come before.
22:58A language model takes that
22:59fundamental technology
23:01and puts it on steroids.
23:03So it takes
23:04we don't not all of
23:05human text, but virtually all
23:07of human text. We don't
23:08know what's in the training
23:08corpus of the latest foundation
23:10models, but we do know
23:11for GPT three class models,
23:12which is all of the
23:13Internet,
23:14a lot of pirated books,
23:16a lot of human textual
23:17material. We also know from
23:18news reports that these companies
23:19are doing things like scraping
23:21YouTube videos, scraping podcasts. They're
23:23hungry for textual data. So
23:26this is the script, by
23:26the way. Huge nerd here,
23:27internal medicine physician. Shocker. This
23:29is a script for Star
23:30Trek two, the wrath of
23:31Khan, which is because I'm
23:32a Trekkie, but it's also
23:33because GPT two had a
23:34very famous experiment where you
23:36trained it only on the
23:37scripts from Star Trek because,
23:38again, the people who built
23:39these models were nerds, and
23:41someone actually built a dataset
23:42only on scripts of Star
23:43Trek, which I love.
23:44So
23:45it breaks every
23:47single,
23:49word, every single piece of
23:50text into units called tokens.
23:52And a token is a
23:53piece of a word. It
23:55is a basic semantic unit
23:57of a language model.
23:58And just like in the
23:59example where you type in
24:00pathologist r and it predicts
24:02the next word, it predicts
24:03the next token based on
24:04this huge training set of
24:06information.
24:07What makes it a large
24:08language model is that it's
24:10based on a technology called
24:11a transformer, which doesn't just
24:13predict the next token,
24:14but it it predicts the
24:15next token in a vector
24:16string of tokens, which means,
24:18semantically, it's predicting the next
24:20word in the context of
24:21a sentence, in the context
24:22of a paragraph, in the
24:23context of an entire book.
24:25And because these are so
24:26large and computationally intensive, it's
24:29doing these these calculations. You
24:30know, you go to Google,
24:31you type it once, that's
24:32one. It's doing that four
24:33hundred billion times per parameter.
24:36So massive amounts of calculations
24:38to figure out what the
24:40next token in a string
24:41should be.
24:42And now
24:44I, as somebody well, I'm
24:45losing my voice. I guess
24:46I'm not surprised. I've been
24:47talking nonstop for the last
24:48twenty four hours. As someone
24:49who studies human reasoning,
24:51humans are both, like, wonderful
24:52creatures, and we're
24:54we're terrible sometimes. Right? So
24:57we wrote the declaration of
24:58the rights of man. We
24:59wrote the, the Torah. We
25:00wrote the Bhagavad Gita. We
25:02also wrote Mein Kampf and
25:03the website four Chan, and
25:04all of those are encoded
25:06in language models. So the
25:08final step that makes these
25:10models there is a pretraining
25:11step that I won't go,
25:11but the final step that
25:12makes them so eerily human
25:14is
25:14reinforcement learning through human feedback.
25:16Sometimes it's called fine tuning.
25:18There's many ways to fine
25:19tune, though. But the idea
25:20here is that a human
25:21being actually sits down and
25:22talks a language model and
25:24says, oh, it's a Skinner
25:25box. Right? You did good
25:26language model, you get a
25:27cookie, or you did bad,
25:28you get an electric shock.
25:29It's a one or a
25:30zero, but same idea.
25:31And then over time, through
25:33all these training cycles, it
25:34makes them remarkably human. So
25:36a fun fact about the
25:37word delve, everybody knows that
25:38ChatJPT loves to say, let's
25:39delve into that. Right? And
25:41there was a great study
25:41that was just published, well,
25:43about a year ago that
25:44showed that in the scientific
25:45literature, the word delve was
25:46almost never used, and now
25:47it's incredibly used because everyone's
25:48using ChatJPT to help write
25:50their papers. Well, why is
25:51delve in there? It turns
25:53out that OpenAI
25:54used contract workers in Nigeria
25:56and Kenya to do most
25:57of their RLHF.
25:58And DELV sounds weird in
25:59American English, but in Kenyan
26:01and Nigerian English, DELV is
26:03commonly used. It's not doesn't
26:04sound weird to them. So
26:06now DELV is a big
26:07part of American scientific literature
26:10because of the RLHF of
26:12a language model of contract
26:13workers in Kenya and Nigeria.
26:15So just one of those
26:16fun things. And, of course,
26:18what you get is something
26:19that is remarkably human. You
26:20know, I tell Chat GPT
26:22I'm doing community theater. I
26:23wanna do the wrath of
26:24Khan. And it says, farewell,
26:25noble admiral. Hold not your
26:26breath. The enterprise cannot save.
26:28She's marked for death. From
26:29the sky, soon shall be
26:30torn asunder, a fiery end,
26:31a final echoing thunder. And
26:33Shakespeare. Right? So he still
26:34goes, Khan.
26:36So I did just do
26:37this so I could do
26:37this meme. You're welcome. But
26:39to also point out that
26:41the technology that is underneath
26:43this, a transformer,
26:44can be used to do
26:46output of any human creative
26:47output. This is just DALL
26:49E three. It's captain Kirk,
26:50at the Globe Theater. DALL
26:51E three is what's called
26:52a diffusion model. There are
26:54now video models. If anyone
26:55has seen Sora, that's the
26:56OpenAI video model. There's this
26:58crazy thing. Like, there are
26:59video games that are diffusion
27:00video games, meaning that you,
27:02like, walk through a video
27:03game, and it renders
27:04every single scene from an
27:06AI, like, hallucination. And then,
27:08of course, most concerningly, you
27:09can clone people's voices really
27:11effectively now, about ninety seconds
27:13of audio, and you can
27:14sound like anybody. And as
27:15someone who has, like, thirty
27:17hours of podcasting out, someone
27:18could easily clone me. And
27:19so if I call you
27:20up, don't trust that it's
27:21me, especially if I'm trying
27:22to get your Social Security
27:23number.
27:24Okay. Why does this matter
27:25for diagnosis?
27:27Because that's cool. Right? I,
27:29I've told a couple people
27:30this because I was working
27:31I I got into all
27:32of this research because I
27:33was working on my second
27:34book, which was about clinical
27:35reasoning. And I actually used
27:36I talked to a bunch
27:37of data scientists before. I
27:38had used GPT three before,
27:41like, CATGPT was released. And
27:43I thought it was stupid,
27:44and it wasn't gonna accomplish
27:45anything, and I was not
27:46impressed at all. So
27:47with that caveat, maybe you
27:49shouldn't listen to anything that
27:50I say. But, like, why
27:51do language models appear to
27:53work in diagnosis? So this
27:54is a real case of
27:55mine. This is a patient
27:56in whom I made a
27:57diagnostic error.
27:58The patient died.
28:00And what I will say
28:01is that the patient was
28:03like, I am friends with
28:04his wife. I have permission
28:05from his family. He was
28:06also a huge trekkie. So
28:07this patient had night sweats,
28:10monocytosis,
28:11a daily fever, ground glass
28:12opacities on the X-ray. He
28:13had been treated for bladder
28:14cancer, like, four or five
28:15years before, BCG.
28:18He did not improve with
28:19antibiotics, which is when I
28:20met him in the hospital.
28:21He still had the fevers
28:22despite high dose antibiotics.
28:23Liver enzymes were crazy. And
28:24then he had spinal hardware
28:26in his back from previous
28:27surgeries. And,
28:29I know, about two weeks
28:30before JPT four was released,
28:32I got back from the
28:33state lab what he had
28:34actually died from. So this
28:35poor man, we thought he
28:35had culture negative endocarditis.
28:37He died from a freak
28:38case of m bovis bacteremia,
28:40presumably reactivated from his BC
28:42BCG treatment. Quite rare,
28:45but misdiagnosis.
28:46It wasn't just me. Like,
28:47I was the the hospitalist
28:48with the residents. Like, there
28:49were a lot of doctors
28:50involved. But, you know, one
28:51of the first things that
28:51I did when GPT four
28:53came around is I I
28:54asked for a second opinion
28:56based on my problem representation
28:57at the time. And what
28:59happens is that it says,
29:01you know, the the number
29:02one diagnosis was what the
29:03patient died from, and the
29:04number two diagnosis was what
29:06I was wrong about.
29:08So, you know,
29:09this is me in twenty
29:11twelve, and my first thought
29:12is what if I had
29:13had this technology six months
29:14ago?
29:15Would this have changed anything?
29:17And we actually don't know
29:18the answer because second opinions,
29:20we haven't there's not great
29:21studies of second opinions, at
29:23least in internal medicine. We
29:24do know
29:25that there are discrepancies that
29:27being quite large, these are
29:28pathological diagnoses, so these are
29:30actually pathology studies.
29:32There's only one good perspective
29:33study on this from the
29:34Netherlands, which did find, like,
29:37frequent switching of diagnoses
29:39and improved symptoms. The patients
29:41actually did better when they
29:42got a second opinion.
29:43But, you know, the data
29:45is pretty early. So how
29:47do language models presumably do
29:49a good job at diagnosis?
29:51Well,
29:52our we've done some cool,
29:53like, ablation studies. Like, I
29:55my job is really cool.
29:56I basically gotta do psychology
29:57on both humans and machines,
29:58but we can do ablation
29:59studies where we knock out
30:00parts of the language model
30:01to figure out what's going
30:02on. And what it appears
30:04to happen is is that
30:05the reason that language models
30:06can make diagnoses is because
30:07of their similarity
30:09to how human brains make
30:10diagnoses. Right? So if you
30:11think about token prediction, you
30:13think about the log probs
30:14of different words,
30:16that's what a script is.
30:17So LLMs are basically
30:20system one on steroids, and
30:22they encode far more knowledge
30:24than we do.
30:25They are not perfect, but
30:27it does appear that this
30:28is what gives them their
30:29remarkable abilities in diagnosis.
30:31So with that, wait, who
30:33is who is my volunteer
30:34internist for this?
30:36Yeah.
30:38That's it. You work on
30:39Yelp?
30:40So I am going to
30:42show an example of
30:44how this technology works.
30:46Oh, jeez.
30:47So no. No. Five years.
30:48It's I'm sorry. No. This
30:51is, so I'll just explain
30:53what we're doing. So right
30:54now, if you go to
30:54the BI and you get
30:56admitted to the emergency room
30:57and you meet certain inclusion
30:58or exclusion criteria, you will
30:59be pulled into our data
31:00pipeline
31:01to study the effect of
31:04different,
31:06second opinions at different what
31:07we call diagnostic touch points.
31:09Because one of the ideas
31:10is when you evaluate these
31:11things, they,
31:13it depends on the information
31:14density. Right? Real diagnosis, you
31:16don't get a case vignette.
31:17You're often operating in
31:19poor information settings. So what
31:21we are doing here is
31:22I am going to, walk
31:24you through this is all
31:25real material. I've stripped it
31:26of PHI. This is a
31:27real patient.
31:28Just so you know, it's
31:29not a I intentionally just
31:31chose a random patient. This
31:32is not a zebra. Okay.
31:34This is not like a
31:34CPC. So what I want
31:36you to do is this
31:37is the information that was
31:39available to the port emergency
31:40room resident, and I want
31:41you to tell me what
31:42your thinking is, what you
31:44would wanna do, and then
31:45I'm going to ask the
31:46model, and you can tell
31:47me how that changes your
31:48thinking.
31:49Okay. So I I'll read
31:50it out loud if you
31:51want even though I'm losing
31:52my voice. So this is
31:53a a young woman who
31:54walks into our emergency room.
31:56This is the ED triage
31:57note. So this is someone
31:58this history is taken not
31:59in the emergency room, but
32:00in the waiting room, and
32:01a nurse has taken this.
32:03So,
32:04chief complaint, they have to
32:05put in a ICD code.
32:07So this is chest pain,
32:08tachycardia.
32:09The triage history, patient reports
32:10a new PE diagnosis and
32:11left lower extremity DBT at
32:13an outside hospital five days
32:14ago. Put an Eliquis ten
32:15BID since four days ago.
32:17Patient now arrives here with
32:18worse than chest pain, cough,
32:19and tachycardia.
32:20The nurse appropriately put this
32:22to the highest severity level,
32:24one, which means that the
32:25doc she goes back immediately
32:26and the doctor sees her.
32:27And vitals, fever, one zero
32:29one. Heart rate, one forty.
32:30Reperatory rate, twenty six. Blood
32:32pressure and o two sats
32:33are fine.
32:35K.
32:38So you want me to
32:39Yeah. Just say what you
32:39would what you would do.
32:42Okay. So,
32:43let's get a chest x-ray.
32:46Let's get an eye
32:49Let's get a chest x-ray.
32:51Let's
32:51get,
32:56I'm worried she is a
32:58I'm worried she still has
32:59a PE. Yeah. I'm worried
33:00she has either,
33:02I'm worried she has a
33:03new infection now for some
33:04reason, and I'm worried she
33:05has, like, a,
33:07like, a dissection possibly. Correct.
33:10So so the top your
33:11top worries are new PE
33:13or worse than PE infection,
33:15something like an aortic dissection.
33:20Me and Max don't get
33:21along.
33:23Okay.
33:25So one of the questions
33:26that I always get when
33:27people come to my lab
33:28is they're like, can I
33:28see the AI? Because they
33:29think there's something really cool
33:30when you see an AI.
33:31The reality is that it's
33:32a it's an Excel spreadsheet.
33:34It's a JSON database. So
33:36it's not exciting. So what
33:37I'm doing here this is
33:38the model I showed you
33:38earlier. Is this is not
33:40the model we're using. In
33:41reality, I'm using a llama
33:42three model. This is o
33:43one, which is the latest
33:45and greatest model from OpenAI
33:47that really freaks me out.
33:48But, we'll see what it's
33:50going to show. And to
33:51be clear, I haven't run
33:52this three zero one yet,
33:53so I don't know what
33:54it's going to show.
33:55Okay.
33:56So
33:58wow. It's going fast.
33:59It agrees with you that
34:01number one on this differential
34:02is a worsening pulmonary recurrent
34:03or worsening pulmonary embolism. Infection
34:05is the number two. It's
34:06picking up that fever. Could
34:07this be a pneumonia? Number
34:09three, it's now considering also
34:10pericarditis.
34:11Could pericarditis be going on?
34:13Fair enough.
34:15ACS, it's considering.
34:17Again, infection,
34:19and then subtherapeutic
34:20anticoagulation,
34:22pneumothorax.
34:22These are all very unlikely
34:24or COVID nineteen infection. And
34:25it also wants very similar
34:27things to you. It wants
34:27an EKG, a chest X-ray.
34:28It wants a repeat CTA.
34:30This is this is its
34:31management plan. What what do
34:32you does any of this
34:33change your thinking?
34:35No.
34:37Is it helpful?
34:39It's It's okay to say
34:41no. I mean, it just
34:42kind of confirms
34:43what I was
34:45thinking anyway, I guess. Yeah.
34:46So it's it's confirmatory. It
34:47makes you more confident in
34:48what you were thinking.
34:51Okay. On to the next
34:52aliquot. So this is very
34:54unique to our workflow. So
34:55what happens next is the
34:57poor ED resident who presumably
34:58has like twenty other patients
35:00has to come immediately to
35:01see this patient because it's
35:02an ESI of one.
35:03The ED resident, she writes,
35:05patient is a young female
35:06presented with the ED for
35:07chest pain and worsening shortness
35:09of breath. Patient notes that
35:10she has a history of
35:11lupus, was diagnosed with a
35:12PE five days ago after
35:13having a CT for shortness
35:14of breath and chest pain,
35:15started on Eliquis. I think
35:17all of this is the
35:17same. Today, she had worsening,
35:19shortness of breath, chest pain,
35:20worsening palpitations, which prompted her
35:22to present to the ED.
35:23And then she was triggered
35:24at all of this, we
35:25already know. So this is
35:26all other stuff. She's a
35:26history of lupus. She's a
35:28history of lupus. So that's
35:29and that this just started
35:30today are the additional things
35:32that the ED resident picked
35:33Does any of that change
35:34your thinking?
35:35I'm worried
35:36now more that she's got,
35:38like, a cons like, a
35:38constrictive pericarditis
35:41or
35:42but the the PE is
35:43still number one. Yeah.
35:45So you're considering other things,
35:46but it hasn't really changed
35:48yet.
35:50Okay. Let's see. Oops.
35:56You would think I know
35:57how to use a Mac.
35:58It's embarrassing.
36:05So what I love yeah.
36:07Turn this off. What I
36:08love about this is you
36:09can actually see what the
36:10model itself is thinking. So
36:11the model, again, similar to
36:12me, the fact that it
36:13started on just one day
36:14means that something like, that's
36:16acute.
36:17So it's changing its thinking
36:19also. Let's see. Did it
36:20pick up on the lupus
36:21here? Yes.
36:24Ah, now that's this is
36:25why I like seeing its
36:26thinking because you can see
36:27it's like, okay. A clot
36:29in lupus, could it be
36:30antiphospholipid antibody syndrome? So still,
36:33like you, it thinks that
36:34a recurrent or worsening PE
36:35is still the number one
36:36thing on the differential. It's
36:37now considering antiphospholipid
36:39antibody syndrome on the differential,
36:41which I think is reasonable
36:42given that additional history of
36:44lupus.
36:45Pericardial fusion. So what you
36:47were mentioning, could this be
36:48pericardial fusion? It's actually worried
36:50about tamponade. I don't know
36:51why because the blood pressure
36:52is normal, but it mentions
36:53that. Blood pressure is normal.
36:55I so no hypotension.
36:57Arrhythmia, pericarditis,
36:59still considering ACS and pneumothorax.
37:02It does put panic attack.
37:03And then what it wants,
37:04I don't think it's changing
37:05what it wants. It wants
37:06an EKG, a CTPA, a
37:07TTE, chest x-ray, all the
37:09standard things.
37:10Does this change like, did
37:12this
37:13second opinion change your thinking
37:14at all?
37:16Really? Does it make you
37:17feel more confident or not?
37:20Yes.
37:22Yeah. It makes me feel
37:23yeah. I guess it makes
37:24me feel more confident.
37:26It's okay to say nothing.
37:28Priest health care is last
37:29week.
37:31Well, that's I mean, the
37:32we get we'll get into
37:33this. Right? So if you
37:34were to give AI generated
37:36second opinions even if very
37:37effective, it might just lead
37:38to overtreatment of everything.
37:40So a lot of what?
37:41We'll talk we'll talk about
37:42that after. Because this is
37:43a this is a big
37:44concern about when and in
37:46what situation you should do
37:47this. Okay. Exam isn't gonna
37:49help you any. I'll put
37:50it in the system, but
37:51this is the ED documented
37:52exam. K. They document it
37:54as completely normal. Okay. I
37:56this is not one of
37:57my like, literally, this is
37:58a random patient that
38:01I picked out. So I
38:02have no idea if it
38:03was actually normal. I'm just
38:04gonna put that in there
38:05so it knows. And then
38:05we'll move on to the
38:06next piece of information, which
38:07is the imaging. So
38:09the resident
38:10orders actually pretty much everything
38:12that you ask for. They
38:14also order an EKG. It
38:15is a problem in my
38:16data pipeline that I'm not
38:17able to pull in EKGs
38:18yet, so there is no
38:18EKG here. But, X-ray shows
38:20bibasilar opacity is compatible with
38:22small bilateral pleural effusions.
38:26CTA,
38:27right,
38:28lower,
38:29right lower lobar and segmental
38:31PE without right heart strain
38:32and a small pericardial effusion,
38:34and then this bilateral
38:36axillary lymph nodes,
38:38as well as this hypodensity
38:40in the liver. TTE is
38:41performed. Big picture, there's no
38:43strain seen on the TTE.
38:45No tamponade.
38:46But so, yeah, those images,
38:47does that change your thinking
38:48at all or pretty much
38:49where you are?
38:51It it I mean, it
38:53knocks down pneumonia
38:54a little bit. It knocks
38:56down, like
38:57she's sounds like her cardiopulmonary
38:59silhouette is normal, so, like,
39:00I'm not worried. Like, she's
39:01got a big
39:02pleural effusion or anything like
39:04that.
39:07Yeah. I think
39:08I'm still
39:09on the on the PE
39:11train. You're still on the
39:12PE train. Okay.
39:14Let me
39:16try to not freak out
39:17my AI model too much.
39:19Oh, why am I scrolling
39:21the wrong direction? Embarrassing Adam.
39:23But I want an EKG.
39:26I don't have it.
39:28This is epic. This is
39:29a snowflake problem.
39:31The EKGs are stored in
39:33another database, so they're actually
39:34not very easy to pull
39:35in until this is the
39:36problem. Until the cardiologist confirms
39:38the read, and then you
39:39can extract it. So because
39:40my data pipeline is running
39:42live, none of our patients
39:43have EKGs.
39:46So this is what we
39:47were talking about. So much
39:48so much of this comes
39:48to, like, understanding where the
39:49data comes from and the
39:50limitations.
39:51Okay. So let's see
39:54how the AI model has
39:55changed its thinking. So like
39:57Hume, it still thinks recurrent
39:59or worsening PE is the
40:00number one diagnosis.
40:01It still is worried about
40:04anaphosolipid antibody syndrome. But sorry.
40:07Can I just say Yeah?
40:07Please. Anaphosolipid syndrome is not
40:09causing her acute, like
40:12I know. So, like, it
40:13doesn't like, that's
40:15it can say that all
40:16at once, and it might
40:17be APS, but, like, she's
40:18still if she's the APS
40:20is causing something. Right. Right.
40:22Right. I it doesn't That
40:24doesn't help
40:25you.
40:26Well, it helps you down
40:27the line, but not It
40:28doesn't help you in the
40:29acute setting. Exactly.
40:31Pericardial effusion, possibly lupus related
40:33infection,
40:34pericarditis,
40:35arrhythmia.
40:36Yeah. None of these these
40:36are all pretty much things
40:37that were on your differential.
40:38Right?
40:40None of this changes your
40:41thinking at all. Doesn't. Except
40:42except to get annoyed because
40:44you're like
40:45even if it's anti antiphospholipid
40:47antibody syndrome, it's still a
40:48pee.
40:49Okay.
40:51Labs,
40:52I don't know that this
40:53is gonna help you much,
40:54but she admit our lactate
40:55cutoff is one point six,
40:56so this is a slightly
40:57elevated lactate. Sodium is one
40:59thirty one.
41:00These other labs are all
41:02relatively normal. A troponin was
41:04negative. That's a a negative
41:05proBNP.
41:06She is anemic. Hemoglobin is
41:08nine one.
41:09Her INR is elevated one
41:11point seven,
41:12with a PTT of thirty
41:13five.
41:14That diff is normal, and
41:16then when they repeated the
41:17lactic acid after fluids, it
41:18was one five, which is
41:20one below the cutoff. So
41:21it's normal in our system,
41:22and the repeat retardant was
41:23negative. I'm going to well,
41:25did the labs change anything
41:26for you? No. Yeah. I
41:27I wouldn't think they would.
41:29And let's see if they
41:30change anything via AI model.
41:42Come on. Show me what
41:43you're doing.
41:47You're running a different chat
41:49GPT.
41:50This is very slow compared
41:51to my experience.
41:53This is, a new model
41:54called o one. You can
41:55see what it's doing. It
41:57is using an internalized chain
41:58of thought process
41:59that, well, that's what it's
42:01doing. It's thinking it's think
42:02thinking through different steps. So
42:04that's why it's going so
42:05slow. In reality, we're running
42:06this on a core with
42:07a llama model that's like
42:08it's like that. It's very
42:09fast.
42:11Okay.
42:12So I'm gonna guess it's
42:13gonna say the same things,
42:14but let's check it out.
42:16Recurrent PE,
42:18it's it's not changing. Right?
42:19So it's basically saying the
42:20same thing.
42:23I don't think any of
42:23these are different.
42:25Okay.
42:28So I'll just go over
42:29the last bit because our
42:31final touch point is when
42:31medicine sees the patient, so
42:33when medicine actually admits the
42:34patient. So the, I bolded
42:36the relevant thing. So, the
42:38ED no. Sorry. The medicine
42:40intern sees the patient still
42:41in the ED, and the
42:42patient says,
42:44Oh, this happened to me
42:45before six years ago. I
42:47had a lupus flare with
42:48very similar symptoms,
42:49and, the medicine resident finds
42:51out that the patient
42:53outpatient rheumatologist
42:54for the last couple weeks
42:55has felt that she's having
42:56a lupus flare and has
42:57been modifying her medications
42:59and that these are the
43:00current medications. So methotrexate,
43:02twenty q seven days, which
43:03has just been increased, plaquenil,
43:05folic acid, and then apixaban,
43:07which is new.
43:08Does any of that
43:11change
43:13your
43:16thinking?
43:18It's okay to say no.
43:20No. It it it doesn't.
43:21It's just you know, I
43:22think there's
43:24two things going
43:27on.
43:30Do you want me to
43:30tell you what the final
43:32Like, I guess my question
43:33is, is it pleuritic?
43:34Is it pleuritic? Yeah. You
43:36wanna know what the final
43:36diagnosis was from the the
43:37medicine team? Is it is
43:39it a lupus flare? It's
43:40lupus flare. Yeah. So the
43:41the final diagnosis is actually
43:43that this the poor woman
43:44has pericarditis and pleuritis from
43:46a lupus flare and had
43:47a PE secondary to that.
43:49So it is two different
43:50things going on. And, you
43:52know, the ED team did
43:54everything completely appropriate. Right? When
43:56you have a patient like
43:56this come in, obviously, you
43:57wanna make sure they're not
43:58having something devastating. So it's
43:59not even that this patient
44:00was mismanaged
44:02in any way.
44:03It's just not what the
44:05initial diagnosis was. So I
44:06was more curious.
44:07Come on.
44:09It's very slow. It's being
44:10very finicky. So I'm curious
44:12what the AI model is
44:13going to end up saying.
44:15You can see how slow
44:16it is. This is all
44:17the different steps.
44:20Well
44:21Friction
44:23I did not see this
44:25patient. This is all mediated
44:27through the physical through the
44:28chart, and I will tell
44:29you that in our data
44:30pipeline, which pulls in only
44:31two different diagnostic
44:33physical exams, no one documented
44:35it, that doesn't mean she
44:36didn't have it.
44:40I highly doubt that the
44:41EP resident did the exam
44:42later. Okay. So let's see
44:43what the model says.
44:45Okay. So this is endo
44:47it it this is the
44:48right final diagnosis, which had
44:49lupus flare with psoriositis.
44:51And then the number two,
44:52the patient has, it's a
44:54pulmonary embolism. And, of course,
44:55what we found out or
44:57I'm not this patient's doctor,
44:58but what ended up happening
44:59is they get the out
45:00the outside CTA, and it
45:01shows that, in fact, the
45:03PE is no larger. It's
45:04even a little bit smaller.
45:05So this is not a
45:05recurrent PE. The symptoms were
45:07likely driven from pericarditis and
45:09pleuritis, so lupus serositis.
45:12The patient was worked up
45:13for antiphospholipid antibodies from all
45:14the tests were negative. So
45:16that's the that's the final
45:17diagnosis. So reflecting back, like,
45:19would this have been helpful
45:20if you were getting this
45:21and would have driven you
45:22in the wrong direction?
45:25I don't know if it
45:26would have driven me in
45:27the wrong direction, but it's
45:30it was confirmatory.
45:31But I mean,
45:33I think
45:35well, two things. One, we're
45:36in an ed setting. So,
45:38you know, I think the
45:39most important thing is ruling
45:41out the things that are
45:41gonna kill her in the
45:42next hour
45:44so, you know
45:46It might be a lupus,
45:47but you don't want to
45:48miss right you don't want
45:48to miss a pe or
45:49a dissection
45:51I guess I don't know
45:51if that's anchoring, but, like,
45:53that is my top things
45:54are I wanna make sure
45:55she's not tamponading. I wanna
45:57make sure she's not having
45:58a massive PE. I wanna
45:59make sure she's not having,
46:00like, a dissection
46:02or she's having, like, a
46:03big heart attack. So, like,
46:05other than that,
46:06it didn't really change anything.
46:08Because you would have done
46:09everything exactly the same. And
46:10you were considering those cannot
46:12miss diagnoses from the very
46:13beginning, obviously.
46:14Yeah. Yeah. I don't think
46:15it would have changed much.
46:17Would it have driven you
46:18in the wrong direction? Right?
46:18Would getting a second opinion
46:20like this have made you
46:21second guess yourself or I
46:23think it would have made
46:24me order more tests.
46:26And in in particular, like,
46:28in the ED, you would
46:28have ordered a bunch of
46:29those tests?
46:31Maybe not in the ED,
46:32but, like, it was I
46:33saw it. It was, like,
46:34get a cardiac MRI.
46:36I hope no one well,
46:37it wouldn't matter. Cardiology wouldn't
46:38do the cardiac MRI, but,
46:39yes, that is not an
46:41appropriate test for this workout.
46:42Yeah. But,
46:43yeah, maybe order more I
46:45I might have ordered more
46:46labs if I'm being honest.
46:47Yeah.
46:48Alright. Well, that's thank you
46:50very much. I'll give you
46:51a a hand. I was
46:52very
46:53sorry that I made you
46:54breakfast in general medicine again.
46:56You thought you escaped.
46:59No. That that's oops. Well,
47:01that's cool that you have
47:02a, one of these things
47:03here. So that's, I mean,
47:04this is an example
47:06of what it looks like
47:08in practice in a randomly
47:09selected case. And you can
47:10start to already see when
47:11you go through it some
47:12of the challenges of implementing
47:14a system like this. I'm
47:15gonna go over some of
47:16the data, including some of
47:17the new data before, seeing
47:18if anybody has any questions
47:19and before I lose my
47:20voice.
47:22So LLMs encode lots of
47:23knowledge.
47:24I'm sure that everyone saw
47:25that, you know, it can
47:26pass the USMLE.
47:27I don't care about this.
47:29You guys should not care
47:30about this either. It turns
47:31out, this is actually from
47:32some of our interesting ablation
47:33studies, that LLMs'
47:35performance on exams
47:37has less to do
47:38what they're doing is that
47:40they have learned the semantic
47:41structure of multiple choice questions,
47:43meaning that they are effectively
47:45good test takers.
47:46Some of my colleagues did
47:47a really cool experiment where
47:48they made up two new
47:49organ systems, and then they
47:50had test writers write up
47:51multiple choice questions with those
47:52fake organ systems. And the
47:54LLMs still did really well
47:55on it because they learned
47:56to understand what a question
47:58looks like and guess the
47:59right answer from that. And
48:00I think everyone here knows
48:01that if you're being honest
48:02with yourself about what multiple
48:03choice like, you start by
48:05excluding a couple things. We
48:06all know how that works.
48:07So none of that matters.
48:09This empathy thing, I think,
48:10is overplayed also. You should
48:12know that
48:13this is the justification for
48:15having LLMs write portal messages,
48:16that patients find their communication
48:18more empathetic.
48:19The standard, of course, is
48:21compared to a very overstretched
48:23PCP who's just trying to
48:24commute
48:25communicate your CBC results. So
48:26these are not actually empathy
48:28in person
48:29communications, but at least in
48:30written communications, people do find
48:32the LLM to be more
48:33empathetic.
48:34For what I care about,
48:36LLMs are able to make
48:37diagnoses
48:38on a lot of the
48:40benchmarks
48:41that, like, the field has
48:42accepted. LLMs have long since
48:44surpassed humans,
48:46but a lot of these
48:47are relatively artificial because they're
48:48very information dense settings, very
48:50complicated diagnoses,
48:52and a lot of when
48:53you look at the diagnostic
48:54errors are not coming from
48:56lupus nephritis. It's coming from
48:57people misdiagnosing
48:59common things.
49:01They have this is fascinating
49:02they have an emergent probabilistic
49:04reasoning,
49:05so there's no reason that
49:07semantic, like language should give
49:09you a probabilistic understanding of
49:11disease states, But in fact,
49:12this is studied with Dan
49:13Morgan. When you compare it
49:14to large groups of humans,
49:16they have a better sense
49:17of understanding the pretest probability
49:19of disease and how that
49:20changes with subsequent tests. That
49:22holds up
49:23pretty well. The post test
49:25probability of disease, it's not
49:27really any better than humans,
49:28but for after a positive
49:30test, but after a negative
49:30test, it's a lot better
49:31than us.
49:33They can forecast similarly well.
49:34So if you ask it,
49:36what do you think the
49:36percentage chance of the final
49:38diagnosis is? This this was
49:39done in neurologists, ID doctors,
49:41and pediatricians.
49:42It outperforms every single individual
49:44human in every single group
49:46of humans,
49:47only being beaten when you
49:48take the best groups and
49:49put them together.
49:52They are able to display
49:54reasoning.
49:55So when it comes to,
49:56like, how will you communicate
49:58with a human,
49:59there's this whole question of
50:00human computer interaction. What would
50:02an AI
50:06LLMs actual cases, as new
50:09information comes in and asks
50:10them to update their thinking
50:11and you compare that to
50:12humans, it outperforms humans consistently.
50:16It outperforms
50:17attendings who outperform residents, and
50:18there's no difference in efficiency,
50:20accuracy, quality, or cannot misdiagnoses.
50:22It does hallucinate more. So
50:24this is actually pretty high
50:25hallucination. Right? Right? It makes
50:26up stuff twelve percent of
50:27the times compared to only
50:28three percent of the times
50:29with humans. Now the hallucinations
50:31are relatively minor. Some of
50:32them are kind of funny.
50:33One of them was a
50:34patient who had diverticulitis,
50:36and the L. M. Wanted
50:37the human to keep gastroenteritis
50:39in mind because the patient
50:40had recently traveled to, Texas,
50:42and going to Texas was
50:43a risk factor for enterotoxigenic
50:45E. Coli, which I'm pretty
50:47sure is not true. So
50:48that is a hallucination,
50:49but probably not one that
50:51would harm the patient.
50:52This study was done by
50:53my colleagues at Google,
50:55very controversial when it came
50:56out because what they did
50:58is it's actually not a
50:59very high performing model. It's
51:00a palm two model, but
51:01the model itself could solve
51:02CPCs. You can see humans
51:03are on the bottom. It
51:04could solve clinical pathological conferences,
51:06and they randomized humans to
51:09either solve conferences themselves
51:11using Google search, using the
51:12AI model, or the AI
51:13model itself. And this was
51:15controversial when it came out
51:16because when you gave humans
51:17the AI model, it actually
51:18made the model not perform
51:19as well. So adding humans
51:21into the mix lowered performance.
51:23This is against
51:24the kind of standard precepts
51:26of the informatics field, so
51:27quite controversial when it came
51:28out. Unfortunately,
51:30I I we my group
51:31ran a large randomized control
51:33trial looking at very nuanced
51:34measures of reasoning in real
51:35cases, so not CPCs,
51:38and we found the same
51:38thing. The human performance is
51:41on the right by itself
51:42in blue.
51:43Humans using the AI are
51:44in green, and the AI
51:45model by itself is in
51:46red. So we found the
51:47same thing, and because we
51:49did a very nuanced measure,
51:50I can tell you why.
51:51And it's humans, when the
51:53AI model tells them that
51:54they're wrong, disregard those pieces
51:55of information. In particular,
51:57humans don't like
52:00the humans don't like an
52:01AI model critiquing
52:02or disconfirming the things that
52:04they think.
52:06Another randomized controlled trial that
52:08was just accepted, this is
52:09the one I just just
52:09heard from nature.
52:10So this is in management
52:12decisions. Management decisions are notoriously
52:14tricky to measure.
52:15In this one, LLMs
52:17did improve people's ability when
52:18they used it to make
52:19management decisions, when you randomized.
52:21But when we look at
52:21the subgroup, it's not what
52:23you would think. Like, people
52:24aren't using the LLM to
52:25say, what is the right
52:26dose of apixaban, or even
52:27should I give apixaban? What
52:28the LLM did was cue
52:30them to, for example, apologize
52:32after making a medical error
52:33or communicate better with other
52:35providers or take patient factors
52:37into account when following a
52:38likely cancerous nodule. So it
52:40actually improved performance
52:41not in, like, what we
52:42think of as the standard
52:43management
52:45domains, but in things that
52:46we think of humans are
52:47good at.
52:49A lot of the work
52:49that I'm doing with Google
52:50is on building models that
52:51can collect data. This is
52:53from the Omni system.
52:55This is standardized patients, not
52:57real patients, but a true
52:58Turing test where a
53:00standardized patients talk to a
53:01a terminal, and they don't
53:02know whether it's a human
53:03or an AI on the
53:04other side. And,
53:06on twenty six of twenty
53:07six patient domains, patients before
53:09the AI, and in twenty
53:11eight of thirty two axes
53:12from the physician graders,
53:14AI was preferred,
53:15and this held up in
53:16every single diagnostic category. So
53:19we're running this in clinical
53:20trials in human like in
53:21actual patients now, It's still
53:22performing quite well, but they're
53:24increasingly able to collect data.
53:27Now
53:28the, the the unpublished data
53:30that I'm about to show
53:31you is from my grad
53:31student that I put in
53:32this presentation this morning because
53:34the models continue to improve.
53:36And if you would ask
53:36me six months ago, I
53:37would say we're seeing convergence
53:39of model performance. I don't
53:40think there's going to be
53:41like, there's still gonna be
53:43incremental improvements,
53:44but this is for, this
53:46is for
53:47solving CPC. So this is
53:48one of these benchmarks that
53:49goes back almost sixty years.
53:51And the new models have
53:52surpassed everything that came before,
53:54and you can see humans
53:55are in brown in the
53:55bottom. This is the one
53:56that freaks me out because
53:58this is not an HCI
53:59study, right? This is just
54:00looking at the human baseline,
54:01but I showed you these
54:02are real cases for the
54:03diagnostic and management decisions.
54:05The colors are different, but
54:06these are the old graphs,
54:07and this is the new
54:08model on the left. And
54:09you can see that the
54:10new models are performing
54:13in
54:13a far better than any
54:15other system, not only much
54:16better than the humans, but
54:17better than the previous AI
54:18systems. So we're continuing to
54:20see performance gains.
54:22Eric Horvitz, the Microsoft group
54:24published their, med prompt follow-up
54:26on o one today, and
54:27they actually they came to
54:28the same conclusion as my
54:29paper, which is, like, these
54:30things have gotten so good
54:31that we need new benchmarks.
54:34We or clinical trials because
54:35they are outperforming everything that
54:37we throw at them.
54:39I can go over these
54:40quickly. In reality, so a
54:42lot of tools are now
54:43being used in clinical practice.
54:45They're actually kind of underperforming
54:46from what we were sold.
54:47So if you look at
54:48some of the early performance
54:49of AI,
54:51scribes, which I know you
54:52guys are using here at
54:53Yale and some of the
54:54clinics, some of the early
54:55studies actually suggest there is
54:56no efficiency gain because they
54:58hallucinate, and the doctors have
54:59to go back and check
55:01the models.
55:02And, yeah, people like it,
55:03but it's not really saving
55:05anybody time and, you know,
55:07people always care about money.
55:08It's not saving anybody any
55:09money either.
55:10The same thing is happening
55:12with the
55:13patient portal messaging. One of
55:15the very depressing things from
55:16the JAMA study on this
55:17is that it actually took
55:18more time, seven percent more
55:20time, physician time, when the
55:22AI wrote the initial drafts
55:23because it hallucinates or says
55:25something harmful, and the doctor
55:26has to go back and
55:27edit it. Again, the patients
55:28liked the responses more, but
55:29it took the doctors more
55:30time.
55:31And then what everybody should
55:33know, I don't think I
55:34need to say this, but
55:34LLMs are racist and sexist.
55:36They actually encode
55:38because they are trained on
55:39our language and then fine
55:40tuned by humans. They encode
55:41all of the biases that
55:43humans have. Now they do
55:45appear I I just published
55:46a study in, JAMA. They
55:47appear to be less racist
55:49and sexist than us, but
55:50they are still racist and
55:52sexist.
55:52So in a world where
55:54we're trying to get past,
55:55like, race based medicine,
55:57especially as LLMs get more
55:58and more powerful, we should
55:59know that they are showing
56:01human, not only cognitive biases,
56:03but racial and gender biases,
56:04which is concerning.
56:06And then we talked a
56:08little bit, but the you
56:08know, HCI is actually quite
56:10challenging because
56:12if used inappropriately, this technology
56:14probably will drive overtreatment.
56:16It
56:17different people need different opinions
56:19at different times. Like, a
56:20second opinion is not universally
56:21helpful.
56:22Also, HCI is unpredictable. There's
56:24a great study from some
56:24of my colleagues at MIT
56:26that showed that the best
56:27radiologists
56:28actually have their performance lowered
56:30by a high performing AI
56:31because they second guess themselves.
56:33So just because an AI
56:34model works well, even if
56:36it consistently works well, like,
56:38in silica, that in silico,
56:40doesn't mean that it's actually
56:41going to improve human performance
56:42because, you know, again, we
56:44are hairless apes that evolved
56:45to, like, live on it'd
56:47be hunter gatherers, and now
56:48we're trying to do complex
56:49medicine in the twenty first
56:50century. So,
56:52whew, I'm gonna lose my
56:53voice. That is it for
56:54this presentation. So if anybody
56:56has any questions or wants
56:57to talk about pathology, I
56:58am happy to, entertain them.
57:05And thank you very much.
57:08Are the new models based
57:10on the performance of the
57:11previous models formed from Zetta
57:13testing? So the new model,
57:15Owen, this is really interesting,
57:16has no new data in
57:18it. There is it is
57:19the same
57:20data as Ford Turbo. So
57:21the cutoff is like last
57:23year. So what in is
57:24improving its performance has nothing
57:25to do with the training
57:26data, but what they're doing
57:27is chain of thought. So
57:29if you get a model
57:29to speak out at SOTA,
57:30it does better. And what
57:31they've done is reinforcement learning
57:33for the chain of thought.
57:34So they're teaching it how
57:35to think out loud and
57:36then reinforcing that over time.
57:38So these models, these are
57:39all computational techniques. It has
57:40nothing to do with the
57:41underlying data, and there's no
57:43more scale. The parameters of
57:44the model are exactly the
57:45same, which is one of
57:46the reasons I'm so freaked
57:47out because I didn't think
57:47we could get such impressive
57:49performance gains without increasing the
57:51number of parameters.
57:54Yes.
58:12Yeah. Yeah. So this is
58:13a great question. Like,
58:15what would it look like
58:16in practice?
58:17So
58:18I'm assuming
58:19we're talking Epic here. Right?
58:21So the the reality of
58:23the situation is that Epic
58:24is working on clinical decision
58:26support software.
58:28This is not a huge
58:29priority. If you look at
58:30what Epic is working on,
58:31they're mostly efficiency. They're working
58:33on like tech summarization.
58:34However,
58:35Epic does make it fairly
58:37easy to have a data
58:38pipeline to put information in.
58:40So even at my own
58:41institution, we have a pipeline
58:41through Amazon Web Services that
58:41I can push a second
58:42opinion in through the chart,
58:42trivially
58:43second opinion in through the
58:44chart, trivially easy. Like, the
58:45any any health system could
58:46do this. Any third party,
58:47there are vendors right now
58:48who want to sell you
58:49this technology. No one should
58:50buy it, by the way,
58:52because this is not tested,
58:56and I'm pretty certain that
58:57it will lead to worse
58:58care if used, like, routinely
59:01on every single patient. So
59:02from a technological standpoint, you'd
59:04need, like,
59:06fifteen hours of a programmer's
59:07time to build a pipeline
59:08to do this. The question
59:10becomes, like, what are the
59:11other strategies that you're going
59:12to do to make sure
59:13that you're giving a second
59:14opinion to the right person
59:15at the right time? At
59:16the BI and what we're
59:17doing through the, Home Run
59:19Network is we're looking at
59:20serving second opinions at clinical
59:22decompensation. So at the moment
59:23that a patient's about to
59:24go to the ICU, based
59:26on this logic that the
59:27patient is already really sick,
59:30we do diagnostic timeouts anyway.
59:32So this is just another
59:33part of the diagnostic time
59:34out. But, so a lot
59:35of our work is like
59:36looking at audit logs, trying
59:37to get a sense on
59:38which patients or which providers
59:39need second opinions, and that's
59:40much more computationally intense.
59:43My guess is, like, in
59:44five years, Epic will just
59:45build this into Epic.
59:48Yes.
59:50So I mean, the question
59:51is, how do I state
59:53of,
59:54let's say, current out as
59:55well?
59:56Is it capable of creative
59:58information? And if not,
01:00:00is it accurate to study?
01:00:03You are asking the right
01:00:04questions. So okay. What I'm
01:00:05gonna say is controversial.
01:00:08LLMs,
01:00:10large language models, codify human
01:00:11knowledge.
01:00:12They are and there's actually
01:00:14there are computational tests to
01:00:15test their ability to be
01:00:16creative for things outside of
01:00:17their training set. I do
01:00:19not think there is any
01:00:20reason to think that any
01:00:21large language model will ever
01:00:23be able to be creative
01:00:25in that outside of its
01:00:26training set. They are effectively
01:00:29locking in human knowledge. Now
01:00:30they can be updated, and
01:00:31they can read things and
01:00:33integrate that new knowledge, but
01:00:34they're still fundamentally limited by
01:00:36what's in their training set.
01:00:37And that gets to, like,
01:00:38well, what are the impacts
01:00:39for medicine? The fact of
01:00:40the matter is when it
01:00:40comes to diagnosis,
01:00:42ninety nine, ninety eight times,
01:00:44we're not being creative, but
01:00:45sometimes that's necessary. And what
01:00:47does this do for human
01:00:48creativity? I mean, everyone's seen
01:00:49this. When you work with
01:00:50an LLM, you have it
01:00:51write something. It's very average.
01:00:53It's very milquetoast.
01:00:54It literally is picking out
01:00:56the average of its training
01:00:57set. That's actually one of
01:00:58the reasons it works well
01:00:59in diagnosis, but there's gonna
01:01:00be downstream effects and the
01:01:02lack of creativity is one
01:01:03of them.
01:01:07Well, you mean in science
01:01:08or in These are advanced.
01:01:10Any that is that we
01:01:11touch. I mean, this is
01:01:12this is a very real
01:01:13concern.
01:01:15Yeah. You're not wrong.
01:01:17I
01:01:18I think this is is
01:01:20that is that depressing? I'm
01:01:21sorry. You're maybe looking for
01:01:22a more optimistic answer there.
01:01:24LLMs
01:01:25will, I don't think, ever
01:01:26be capable of creativity in
01:01:28the way that a human
01:01:29is.
01:01:32Oh, I have all the
01:01:32time in the world.
01:01:34That's not true. But
01:01:36What's sometimes specifically the physical
01:01:39exam? You know, how can
01:01:40we be doing a diagnostic
01:01:42differential
01:01:43based on the exam?
01:01:45And and this is a
01:01:46bigger question. I got really
01:01:47the way,
01:01:49the best side of,
01:01:51schedules that we need,
01:01:53or
01:01:55errors. And there's also implications
01:01:57for what we should be
01:01:57teaching our students.
01:01:59A physical exam, in terms
01:02:01of collecting data is not
01:02:02something that an LLM can
01:02:03do now.
01:02:05Multimodal models this is why
01:02:06I'm telling you pathology everyone
01:02:08who confidently predicts that pathology
01:02:09is going to be computerized,
01:02:11that technology is way ways
01:02:12away. Multimodal models do not
01:02:14perform very well.
01:02:16And
01:02:17maybe five to ten years
01:02:18from multimodal models being able
01:02:20to perform at a human
01:02:20level. But so in the
01:02:22interim, an accurate physical exam
01:02:24becomes incredibly important. And you
01:02:26saw that exam that the
01:02:27student that the resident put
01:02:28in there. That's a templated
01:02:29exam.
01:02:31I mean, probably the resident
01:02:32did an appropriate exam for
01:02:34someone who they thought might
01:02:35have a dissection or PE,
01:02:36but she documented just the
01:02:38templated exam and that can
01:02:40throw off a language model.
01:02:41So when it gets to,
01:02:42like, what are the things
01:02:43that are uniquely good at
01:02:44humans in a world where
01:02:45we're working more and more
01:02:46with AI models,
01:02:47good observational skills and learning
01:02:50how to do those skills
01:02:51and then
01:02:52accurately
01:02:53represent them is really important.
01:02:55I think we'll call it
01:02:56a day. We'll get Adam
01:02:58a chance to do a
01:02:58Phew. Yeah.