Skip to Main Content

Pathology Grand Rounds, November 7, 2024

November 08, 2024

Adam Rodman, MD, MPH, FACP, of Harvard Medical School and Beth Israel Deaconess Medical Center, presents on, "Towards an AI Second Opinion: Clinical Reasoning Large Language Models and How We Might Make Humans a Little Better."

ID
12333

Transcript

  • 00:00Hi, everybody. Welcome to Grand
  • 00:02Pathology Grand Rounds.
  • 00:04I'm Rob Homer.
  • 00:06I am exceptionally pleased to
  • 00:08invite someone,
  • 00:09to speak to.
  • 00:11It's, Adam Rodeman's, has a
  • 00:13dual degree in medicine public
  • 00:14health from Tulane,
  • 00:16trained in internal medicine at
  • 00:17Oregon Health and Science University,
  • 00:20the global health fellow at
  • 00:21the Beth Israel Deaconess, where
  • 00:23he actually wound up working
  • 00:24in Botswana for a couple
  • 00:25of years and wound up
  • 00:26developing a curriculum for, interns
  • 00:29in Botswana, which is great.
  • 00:31He's currently assistant professor of
  • 00:33medicine at Harvard Medical School
  • 00:34and attending internist at Beth
  • 00:36Israel Deaconess.
  • 00:38So I became familiar with
  • 00:39Adam from his,
  • 00:41well known podcast
  • 00:43on bedside rounds,
  • 00:44which is phenomenal because it
  • 00:46goes into the history of
  • 00:47medicine. And I have to
  • 00:48say that if you wanna
  • 00:50understand, like, every time you
  • 00:51look around and says, why
  • 00:52are things like this? It
  • 00:53turns out there frequently is
  • 00:54a reason, and it's frequently
  • 00:55found in history. And it
  • 00:57was like, oh, that's why
  • 00:58we do this. And so
  • 00:59it really provides a lot
  • 01:00of insight, and I surely
  • 01:01recommend it. Also, I have
  • 01:03to say, at dinner last
  • 01:05night, I found Adam just
  • 01:06enthusiasm. If we could just
  • 01:08harness that, we don't need
  • 01:09nuclear. We don't need solar.
  • 01:11We just need Adam. It's
  • 01:12phenomenal.
  • 01:14He is,
  • 01:15having his history. So he's
  • 01:17trans he transformed from history
  • 01:19to
  • 01:20AI. How did that happen?
  • 01:21Because there's a history of
  • 01:22machine learning in medicine, which
  • 01:23goes back a long way.
  • 01:24When I was going to
  • 01:25medical decided to go to
  • 01:26medical school when I was
  • 01:27in college, one of my
  • 01:28classmates said, why are you
  • 01:29doing this? Computers are gonna
  • 01:30take this over any day
  • 01:31now. And that was in
  • 01:32nineteen seventy nine. So a
  • 01:34little maybe it's true, but
  • 01:36it just took a little
  • 01:37while.
  • 01:39His research founded, funded by
  • 01:41the Macy Foundation,
  • 01:42Gordon Betty Moore Foundation, National
  • 01:44Academy of Medicine, explores integration
  • 01:46of large language models and
  • 01:47complex diagnostic challenges, which you're
  • 01:49gonna hear today,
  • 01:50associate editor for the New
  • 01:52England Journal of Medicine AI,
  • 01:53published work in JAMA Open
  • 01:55Network, JAMA Internal Medicine, New
  • 01:56England Journal of Medicine, among
  • 01:57other journals.
  • 01:59But
  • 02:00and, actually, you know, the
  • 02:01rationale for really being here,
  • 02:03he's planning the innovation of
  • 02:04digital in education and medicine,
  • 02:06codirects at the EyeMed program,
  • 02:08which is dedicated to exploring
  • 02:09digital education.
  • 02:10He is among other things,
  • 02:12he has a digital education
  • 02:13track for future educators to
  • 02:15learn about,
  • 02:16electronic education
  • 02:18technology.
  • 02:19Currently leader of a task
  • 02:20force to integrate, into the
  • 02:21Harvard Medical School undergraduate
  • 02:23medical education.
  • 02:25Received numerous teaching awards,
  • 02:27and mentorship awards, including most
  • 02:29recently the Herman Blumgard faculty
  • 02:31teaching award.
  • 02:32Join me in welcoming Adam.
  • 02:34I'm really excited to hear
  • 02:35this. Thank you. This is
  • 02:36gonna be really cool. I'm
  • 02:37a hi, everybody. I'm Adam.
  • 02:41Okay. Can you guys hear
  • 02:42me? Oh, good. I'm a
  • 02:43pacer. Before I start
  • 02:45oh, I'm so sorry. If
  • 02:46you're an internal medicine physician,
  • 02:48could you raise your hand?
  • 02:51I've had a feeling I
  • 02:52was gonna pick on you
  • 02:52guys. I would like actually,
  • 02:54so
  • 02:55at
  • 02:56about twenty minutes in, I'm
  • 02:58going to demonstrate
  • 02:59our actual clinical workflows.
  • 03:01If one of the internists
  • 03:02would be willing it's not
  • 03:04a competition, but to actually
  • 03:05use the AI to one
  • 03:07of the real patients from
  • 03:07our studies, would you be
  • 03:08willing to?
  • 03:11Yeah. Okay.
  • 03:12So hi. My name is
  • 03:13Adam. I, I'm not a
  • 03:15pathologist,
  • 03:16so I apologize in advance.
  • 03:18I am I don't know.
  • 03:19I don't know who's nerdier,
  • 03:20but in general, internists are
  • 03:21some of the nerdiest physicians
  • 03:22alive.
  • 03:23I got a nod over
  • 03:24there. So
  • 03:26I research how people think.
  • 03:28And in pathology
  • 03:29and when it comes to
  • 03:30machine learning in particular, there's
  • 03:32really been a focus
  • 03:33unfairly, I think, on, like,
  • 03:35classification models. Right? Just what
  • 03:37is this image? Which I
  • 03:38think, well, not I think.
  • 03:40I know underplays the actual
  • 03:41cognitive processes that go in
  • 03:43pathology, which is why everyone
  • 03:45still here has a job
  • 03:46and is in no danger
  • 03:47of losing a job anytime
  • 03:48soon. Now what I'm going
  • 03:50to talk about today is
  • 03:51my research, but I'm gonna
  • 03:53try to make it entertaining
  • 03:54to talk about
  • 03:56language models,
  • 03:57reasoning, multimodal models, and what
  • 04:00the data in here, I'll
  • 04:01just have, you know, I
  • 04:02was talking to my PhD
  • 04:03student this morning. I have
  • 04:04updated this slide with data
  • 04:05that has not been published
  • 04:06yet that was run yesterday.
  • 04:07So this is literally
  • 04:09the,
  • 04:10latest breaking data here. But
  • 04:13to give you an idea
  • 04:13of where we've come already
  • 04:15and where we're gonna
  • 04:16go. So
  • 04:17I wanna start with this
  • 04:18idea that is relevant to
  • 04:20basically every medical specialty, and
  • 04:21it's what is diagnosis? What
  • 04:23does it mean to make
  • 04:25a diagnosis? This is what
  • 04:27we all do. This is
  • 04:27theoretically what internists do, though
  • 04:29often we don't get a,
  • 04:30we don't get a definitive
  • 04:31diagnosis. This is what we
  • 04:32do in pathology.
  • 04:34Basically, all of medicine
  • 04:36is not all of medicine,
  • 04:38but it's focused on this
  • 04:39idea of diagnosis. Even when
  • 04:40we're talking downstream about
  • 04:42management, it
  • 04:44usually
  • 04:45relies on the diagnostic process.
  • 04:47But what does it mean
  • 04:48to make a diagnosis? So
  • 04:50there's a great paper. I
  • 04:51always say it was just
  • 04:52published. It was actually published
  • 04:52about fifteen years ago. I'm
  • 04:53just getting old and unstuck
  • 04:55in time, but there's a
  • 04:56great paper that looks at,
  • 04:59like, what do we mean
  • 04:59when we see diagnosis? What
  • 05:00do we mean when we
  • 05:01say clinical reasoning? And unfortunately,
  • 05:04as I'm sure everybody here
  • 05:06knows, there's not a standard
  • 05:08definition and everybody is talking
  • 05:09about something a little bit
  • 05:11different, which means we have
  • 05:12a tendency to talk past
  • 05:13each other. So to give
  • 05:15you a sense on where
  • 05:16the science stands, it's complicated.
  • 05:18Yeah. I'm gonna do some
  • 05:19Simpson stuff in here. So
  • 05:20I realized that as a
  • 05:21geriatric millennial that that dates
  • 05:23me quite considerably.
  • 05:24But at its most basis,
  • 05:26basic dating back to the
  • 05:27ancients, diagnosis is a classification
  • 05:30task. Like, what is nosology?
  • 05:31Nosology is the way that
  • 05:32we categorize
  • 05:33diseases.
  • 05:34And fundamentally, making a diagnosis
  • 05:36means, okay. I have to
  • 05:38come up with a single
  • 05:39disease or multiple diseases in
  • 05:41my differential out of this
  • 05:42large classification
  • 05:44schema. Well, that sounds easy.
  • 05:45I, I think what's funny
  • 05:47is if you go back
  • 05:48to some of the early
  • 05:49literature, like this is from
  • 05:50the fifties on diagnosis, they
  • 05:51literally thought it was this
  • 05:52easy. Everybody here knows it's
  • 05:54much more complicated than that.
  • 05:57So modern researchers into the
  • 05:58nature of diagnosis have focused
  • 06:01more on human psychology.
  • 06:03So, RIP Danny Kahneman,
  • 06:05if anybody has read Thinking
  • 06:07Fast and Slow lately, I'm
  • 06:08I'm sorry. I'm gonna very
  • 06:10quickly summarize it. I will
  • 06:11also say I actually recently
  • 06:12reread Thinking Fast and Slow
  • 06:14just, like, two months ago.
  • 06:15It is I don't recommend
  • 06:16reading it. Danny Kahneman, brilliant
  • 06:18person,
  • 06:18very droll
  • 06:20writer.
  • 06:21Again, RIP Danny Kahneman. So
  • 06:23our understanding of modern diagnosis
  • 06:25is, you know, based on
  • 06:26cognitive psychology.
  • 06:28So this idea
  • 06:30that
  • 06:31most of the process is
  • 06:32this very fast, very contextual
  • 06:35automatic system one process. So
  • 06:37system one, very classically, I
  • 06:39I great example is you're
  • 06:40driving on the highway going
  • 06:41home. All of a sudden
  • 06:43you stop paying attention. Fifteen
  • 06:44minutes go by. You're still
  • 06:45on the highway. Everything's going
  • 06:47well. What happened? Your brain
  • 06:49went into these very automatic
  • 06:51thought processes and that's our
  • 06:52classic system. One,
  • 06:54we talk about heuristics or
  • 06:56mental shortcuts often system one
  • 06:58gets maligned, right? We spend
  • 06:59a lot of time talking
  • 07:00about how system one can
  • 07:01lead us wrong. But the
  • 07:02fact of the matter is
  • 07:03system one exists. We evolved
  • 07:05this way for a reason.
  • 07:06It's fast, it's efficient, and
  • 07:07it works pretty well.
  • 07:09System two are the very
  • 07:10slow, very contextual,
  • 07:13very labor intensive, and theoretically
  • 07:16less biased, not in reality,
  • 07:17but theoretically less biased kind
  • 07:19of formal thought processes.
  • 07:21So what does this look
  • 07:22like in process? Don't worry.
  • 07:23I'm not gonna go over
  • 07:24this. This is, from Pat
  • 07:25Cross Gary who I've talked
  • 07:26about a a bit over
  • 07:27the last day. So this
  • 07:28is, Pat Cross Gary's grand
  • 07:30unified
  • 07:31theory of clinical reasoning. Again,
  • 07:33don't worry about it, but
  • 07:34the gist is
  • 07:35we cross over between system
  • 07:36one and two. What does
  • 07:37this look like in practice?
  • 07:39So I think probably everybody
  • 07:40has read or is aware
  • 07:41of Judith Bowen's very famous
  • 07:43paper from the early two
  • 07:44thousands, which is based on
  • 07:45the work of Bourdage about
  • 07:46a decade before. But our
  • 07:48current understanding,
  • 07:50cognitive psychological understanding of the
  • 07:52diagnostic process is something called
  • 07:54script theory. And this is
  • 07:55a knowledge encoding and knowledge
  • 07:57activation theory.
  • 07:59So
  • 08:00script theory, how does it
  • 08:01work? The idea is that
  • 08:02we know things about our
  • 08:03patients. We know things about
  • 08:05path pathology from a variety
  • 08:07of sources. Medical school, of
  • 08:09course, reading journal articles, and
  • 08:10more importantly, from seeing patients
  • 08:12and from learning, from our
  • 08:13practice, we get feedback. And
  • 08:15all that knowledge gets encoded
  • 08:17in our brains, but we
  • 08:18are not a library. We
  • 08:19are not the Dewey decimal
  • 08:21system. I do not say,
  • 08:23heart failure with risk reduction
  • 08:24fraction
  • 08:25is coded at e twenty
  • 08:27three dot o five. That's
  • 08:28not the Dewey decimal system.
  • 08:29That's like another classification system
  • 08:31that I forget. I spend
  • 08:32too much time in academic
  • 08:33libraries. But, no, how does
  • 08:34that information get encoded? And
  • 08:36it gets encoded
  • 08:38in these things that in
  • 08:39medical education and the psychology
  • 08:41world we call scripts. Scripts
  • 08:43are a psychological principle actually
  • 08:44from the nineteen seventies. It
  • 08:46comes from this idea of,
  • 08:48there are stereotyped
  • 08:49patterns of behavior. So for
  • 08:50example, you walk into a
  • 08:51restaurant,
  • 08:52you have a script that
  • 08:54you follow, and when you
  • 08:55deviate from that script, it
  • 08:56freaks people out. And if,
  • 08:57for example, you were to
  • 08:58go to your friend's house
  • 08:59for dinner and you try
  • 08:59to follow the restaurant script,
  • 09:00it would be incredibly rude.
  • 09:02So, you know, in the
  • 09:03early nineties,
  • 09:04psychologists who study clinical reasoning
  • 09:06realized we're doing the same
  • 09:07thing with diseases.
  • 09:09So the idea here is
  • 09:10that when we
  • 09:12activate information about a disease,
  • 09:14what we're really doing is
  • 09:15telling a story to ourselves.
  • 09:16And that story has, obviously,
  • 09:18the presentation.
  • 09:19That story has the pathological
  • 09:21diagnosis, what that might look
  • 09:23like. It has other, you
  • 09:25know,
  • 09:26the the treatments, all of
  • 09:27that is organized together. And
  • 09:29then more importantly, scripts do
  • 09:30not exist
  • 09:32in, isolation. It's not like
  • 09:34a library. They exist in
  • 09:35these parallel systems of networks
  • 09:38called schema, where almost by
  • 09:40definition, if you activate one
  • 09:42script, you're excluding another. And
  • 09:44in medical education, we spend
  • 09:45a lot of time, you
  • 09:46know, talking to our students
  • 09:48about semantic qualifiers.
  • 09:50You know, is this acute,
  • 09:51subacute, chronic,
  • 09:53polyarticular, monoarticular? And the reason,
  • 09:55the psychological reason,
  • 09:58this is right out of
  • 09:58bordage, is that fundamentally
  • 10:02those are the ways that
  • 10:03we include or exclude different
  • 10:05diagnoses. This goes back to
  • 10:06a really interesting study by
  • 10:07Arthur Elstein in the late
  • 10:0870s, where at Harvard took
  • 10:10a bunch of medical students,
  • 10:11took a bunch of attendings,
  • 10:12and then had them think
  • 10:13out loud
  • 10:15about what was going on.
  • 10:16And as he expected, as
  • 10:18the dominant theory was, everyone
  • 10:20did the Sherlock Holmes thing.
  • 10:22Everyone was like, okay, this
  • 10:23is my theory. These are
  • 10:24the next tests, the hypothetical
  • 10:26deductive process. But what else
  • 10:28he noted is, okay,
  • 10:30well, the attendings only asked
  • 10:32five questions and they got
  • 10:34the diagnosis right. The medical
  • 10:35students asked thirty questions and
  • 10:37they were far less accurate.
  • 10:39So there's something else going
  • 10:41on than simply following a
  • 10:42hypothetical deductive process. So what
  • 10:45does this look like in
  • 10:46practice?
  • 10:47So I'm in clinic. A
  • 10:48patient comes up to me.
  • 10:49She says what did she
  • 10:50say? I can't breathe. It
  • 10:51started yesterday.
  • 10:53Everyone has had this experience.
  • 10:54I mean, it's a little
  • 10:55bit different with pathology, but
  • 10:56you look at something and
  • 10:57then automatically your mind starts
  • 10:59to sort that information. What
  • 11:01is happening? You are activating
  • 11:02one of these scripts. In
  • 11:04medicine, we talk about a
  • 11:05problem representation, the instantiation of
  • 11:07a problem representation,
  • 11:09which is in this case,
  • 11:10acute dyspnea.
  • 11:11This is a way of
  • 11:12teaching, but it's activating that
  • 11:13part of my mind. This
  • 11:14idea that if somebody tells
  • 11:15me all of a sudden
  • 11:16I can't breathe versus six
  • 11:18months slowly progressive shortness of
  • 11:19breath fundamentally activates different ways
  • 11:22that I think about information.
  • 11:23And then
  • 11:24and everyone has had this
  • 11:25experience. Without thinking, you know
  • 11:28what questions to ask next.
  • 11:30Why? Because they're going into
  • 11:32your script for acute dyspnea.
  • 11:33You are going through that
  • 11:34schema, and you're asking follow-up
  • 11:36questions. So I hypothesis test
  • 11:38COPD.
  • 11:39Hypothesis test pneumonia,
  • 11:41and based on that, further
  • 11:42refine my problem representation
  • 11:45until I get something that
  • 11:47is reasonably close to the
  • 11:48diagnosis or I find something
  • 11:51that doesn't make sense and
  • 11:52I cross over to other
  • 11:54types of metacognitive processes. Now,
  • 11:56I said that I'm going
  • 11:57to go over some of
  • 11:58the, like,
  • 11:59different understandings of clinical reasoning
  • 12:01because I will say that
  • 12:02until,
  • 12:03let's say, about twenty ten,
  • 12:05we had basically agreed in
  • 12:07the reasoning field that script
  • 12:09theory was it.
  • 12:10I think that
  • 12:12it doesn't take much to
  • 12:13think about the problems with
  • 12:15script theory and that it's
  • 12:16not a necessary or sufficient
  • 12:18way to think about how
  • 12:19we make decisions. So my
  • 12:20classic example is right now
  • 12:21I've had about a half
  • 12:22gallon of coffee. So let's
  • 12:23say I'm admitting patients, I'm
  • 12:24well rested, well caffeinated, ten
  • 12:26AM.
  • 12:27My mind is gonna work
  • 12:28differently than when I'm working
  • 12:29at two AM. My pager
  • 12:32goes off every five minutes,
  • 12:33and my epic chat just
  • 12:34keeps going red over and
  • 12:35over and over again because
  • 12:37our minds don't I mean,
  • 12:39they don't work like that.
  • 12:40There are a lot of
  • 12:41ecological
  • 12:42factors that affect our reasoning
  • 12:44process. And this is ecological
  • 12:46psychology is one of the
  • 12:48frames now that we think,
  • 12:49which is that reasoning is
  • 12:50not something that just happens
  • 12:51up here. Reasoning is something
  • 12:53that happens in our bodies,
  • 12:55happens in our environment, and
  • 12:56these environmental factors play a
  • 12:57large part. Again, I don't
  • 12:58think any of this is
  • 12:59controversial.
  • 13:01I will put up this
  • 13:02idea of situated cognition.
  • 13:04This so to say that
  • 13:05something is situated is to
  • 13:06say that it's contingent. These
  • 13:07are all psychological terms. But
  • 13:09the idea here is that
  • 13:10there's no such thing as
  • 13:11neutral decisions. They're all shaped
  • 13:13by our prejudices, by our
  • 13:14life experiences.
  • 13:16And because of that decision
  • 13:17making, per situated cognition
  • 13:20is individualized.
  • 13:22I don't find this very
  • 13:23helpful, like, from somebody who
  • 13:24wants to, you know, make
  • 13:25people better. I don't find
  • 13:26this a very helpful psychological
  • 13:27theory, but it is one
  • 13:28that is out there that
  • 13:29you should know about. And
  • 13:30then one of the big
  • 13:32things that I think is
  • 13:32especially relevant in pathology is
  • 13:34I think about how I
  • 13:35make diagnosis in the real
  • 13:36world. Also, you can tell
  • 13:37this is an AI generated
  • 13:38images because a lot of
  • 13:39these, have, like, weird demon
  • 13:41faces as the model gets
  • 13:43confused and tries to average
  • 13:44Simpsons characters and other things.
  • 13:46But, you know, I might
  • 13:47make a diagnosis, and I'll
  • 13:48pat myself on the back
  • 13:49that maybe I've diagnosed lupus
  • 13:51nephritis and a new case
  • 13:52of lupus. But you know
  • 13:53what? What I'm doing is
  • 13:54I'm looking at a creatinine
  • 13:55trend that was identified by
  • 13:57the patient's PCP, and nephrologist
  • 13:58is an outpatient, and I
  • 14:00might be looking at an
  • 14:00ANA or a double stranded
  • 14:02DNA that was sent much
  • 14:03later, and all of this
  • 14:04is mediated through the electronic
  • 14:05health record. So the reality
  • 14:07of diagnosis, it's not really
  • 14:09collaborative, right? It's not in
  • 14:11the sense that we're in
  • 14:11the same room working together,
  • 14:13but most
  • 14:14diagnosis, I don't know most,
  • 14:16I don't know the percentage,
  • 14:17but a lot of diagnosis
  • 14:18in the year twenty twenty
  • 14:19four is mediated through the
  • 14:20electronic health record happening with
  • 14:22different people at different places
  • 14:24in time.
  • 14:25Okay. So why do we
  • 14:27care so much about the
  • 14:28cognitive processes of reasoning? Like,
  • 14:31why why does anybody give
  • 14:32me money money to study
  • 14:33this, which I wonder myself
  • 14:35also sometimes?
  • 14:36And for the last, let's
  • 14:37say, twenty five years, the
  • 14:39focus has been on this.
  • 14:40Ever since to is human
  • 14:41has come out, there has
  • 14:42been a huge focus on
  • 14:44medical errors and the burden
  • 14:45of medical errors. This is
  • 14:46from David Newman Toker's, most
  • 14:47recent paper. I actually think
  • 14:49that this is what he
  • 14:50pulled out. I think the
  • 14:51more damning thing is that
  • 14:53he says fifteen diseases are
  • 14:54for half of serious harms.
  • 14:55Four diseases accounted for thirty
  • 14:57nine percent of harms in
  • 14:58this study, and those diseases
  • 14:59were such exotic things as
  • 15:00stroke and heart attacks. So
  • 15:03doctors make misdiagnoses that cause
  • 15:04harm all the time. We're
  • 15:06not missing lupus nephritis. Well,
  • 15:08we are, but we're also
  • 15:08missing, like, heart attack and
  • 15:10stroke.
  • 15:11This is from Andy Auerbach's
  • 15:13group. Fantastic
  • 15:14study that looked at
  • 15:17very specific cognitive errors because,
  • 15:19you know, one of the
  • 15:19challenges always been, yeah, there's
  • 15:21there's diagnostic errors. Of course,
  • 15:22they are. But is that
  • 15:23because of our human brains,
  • 15:25or is that because of
  • 15:26other systems factors? I mean,
  • 15:27we all know that
  • 15:29getting outside records can be
  • 15:31difficult. There's all sorts of
  • 15:32interruptions. So maybe it's just
  • 15:34our fragmented health system that's
  • 15:35doing this. And David Newman
  • 15:37Joker, what what he sorry.
  • 15:38Andy Auerbach, what he did
  • 15:40is looked at fourteen different
  • 15:41hospitals across the US, patients
  • 15:43who were admitted to the
  • 15:44hospital, appropriately triaged, meaning that
  • 15:45they didn't decompensate within the
  • 15:47first forty eight hours, and
  • 15:48then either went to the
  • 15:49ICU or died. And in
  • 15:50those patients, a quarter of
  • 15:52them had a diagnostic error,
  • 15:54and a lot of these
  • 15:56were severe. One fifth of
  • 15:57those errors caused severe harm
  • 15:58or death. And when we
  • 16:00looked at though, he looked
  • 16:01at the reasons, the most
  • 16:02common cause was this, errors
  • 16:04in human cognition by far,
  • 16:05almost half, with the second
  • 16:07biggest being in test interpretation
  • 16:08and ordering, which is also
  • 16:09a human cognitive error. So
  • 16:11systems errors were actually much
  • 16:12less common. Mind you, this
  • 16:13is a very specific situation,
  • 16:15so patients in the hospital
  • 16:16who are decompensating, so I
  • 16:17don't know that that's generalizable.
  • 16:19But in this population, cognitive
  • 16:21errors are common.
  • 16:23So
  • 16:24in my field,
  • 16:25which I guess is medical
  • 16:26education, reasoning has usually been
  • 16:27within the med ed field,
  • 16:29but we've looked historically at
  • 16:31ways that we can make
  • 16:32people better.
  • 16:33Three big categories. So first,
  • 16:36everyone here who is my
  • 16:38age or less,
  • 16:40I'm forty,
  • 16:41has probably
  • 16:42received some education about cognitive
  • 16:44biases. Like, actually, raise your
  • 16:46hand if you've ever been
  • 16:46taught about cognitive biases.
  • 16:49Yeah. I would expect to
  • 16:50see half or more. So
  • 16:52cognitive biases, of course, this
  • 16:53is like your anchoring bias.
  • 16:54This is recency bias.
  • 16:56This has been a major
  • 16:57trend in medical education over
  • 16:58the last fifteen, twenty years.
  • 17:00The problem is that when
  • 17:01we study these things experimentally,
  • 17:03teaching people about cognitive biases
  • 17:05does not make them more
  • 17:06less likely to have cognitive
  • 17:08biases. I can teach you
  • 17:09all about anchoring bias. You
  • 17:11are still going to be
  • 17:12anchor. You you are still
  • 17:13going to anchor. You may
  • 17:14have the metacognition that you're
  • 17:15doing that, but you can't
  • 17:17stop. And why? Like, sometimes
  • 17:19I'm pretty amazed. Like, look
  • 17:20where we are. Look at
  • 17:21this technology. I am like
  • 17:23a relatively hairless ape that
  • 17:26evolved over fifty million years
  • 17:27ago on the plains of
  • 17:28Africa. Our brains are not
  • 17:29evolved to work in this
  • 17:31highly technical world. So it
  • 17:33is no not surprising that
  • 17:34there are cognitive biases. These
  • 17:35are the shortcuts that we
  • 17:37evolved with.
  • 17:38It it's just part of
  • 17:39being human, so that doesn't
  • 17:41work. Number two is education
  • 17:43about debiasing strategies. This is
  • 17:45very popular and kind of
  • 17:47optimistic because it works. So
  • 17:48if you think about deliberative
  • 17:50practice, the literature on deliberative
  • 17:51practice,
  • 17:52you know, this would be
  • 17:53the Malcolm Gladwell ten thousand
  • 17:55hours,
  • 17:57even though that has been
  • 17:59somewhat debunked.
  • 18:00But, like, those psychological studies,
  • 18:03if you were to spend
  • 18:04five hours every single week
  • 18:06going over
  • 18:07every single pathologic diagnosis you
  • 18:09made, reflecting on what you
  • 18:10could have done better, seeing
  • 18:11how the patients did, you
  • 18:13will get better at the
  • 18:14diagnostic process. This has been
  • 18:16studied in internists.
  • 18:18It works.
  • 18:19No one has five hours.
  • 18:21And if you were to
  • 18:21spend five hours doing that,
  • 18:23your bosses would get mad
  • 18:24at you for not closing
  • 18:25out enough cases or for
  • 18:26me, like, my my length
  • 18:27of stay would go up.
  • 18:28Like, there's our system does
  • 18:30not allow for these debiasing
  • 18:32strategies as effective as they
  • 18:33are. I mean, to this
  • 18:34day, I keep a follow-up
  • 18:35list. I have ten patients
  • 18:36on it because that's all
  • 18:37I can put because I
  • 18:38don't have time. And whenever
  • 18:39I put a new patient
  • 18:40on, I have to take
  • 18:40an old patient off. I
  • 18:41spend maybe thirty minutes a
  • 18:43week. I don't even know
  • 18:44if it makes me that
  • 18:44much better, but at least
  • 18:46I try. And even then,
  • 18:47it's really hard to do
  • 18:49that. So that brings us
  • 18:50to number three,
  • 18:52which, doctor Homer mentioned, AI.
  • 18:54We've been talking about AI
  • 18:55for a long time.
  • 18:57So artificial intelligence has been
  • 18:59the third historical and actually
  • 19:01the oldest way to make
  • 19:02humans better. So this is
  • 19:04where,
  • 19:06doctor Homer was talking about
  • 19:07nineteen seventy nine. I assume
  • 19:09they were talking about internist
  • 19:10one probably,
  • 19:11but we've been talking about
  • 19:12this for over a century.
  • 19:14So this is the oldest
  • 19:15quote I've ever found about
  • 19:17artificial intelligence in medicine from
  • 19:18Bernard Shaw. So talking about
  • 19:20reflecting on how, like, no
  • 19:22one does math anymore in
  • 19:23in bit, like, old fashioned
  • 19:24spreadsheets, paper where we had
  • 19:25to do all this math.
  • 19:26By the early twentieth century,
  • 19:27there were counting machines that
  • 19:28did all that. And he
  • 19:29says, in the clinics and
  • 19:30hospitals of the near future,
  • 19:31when they quite reasonably expect
  • 19:33that the doctors will delegate
  • 19:34all the preliminary work of
  • 19:35diagnosis to machine operators.
  • 19:38And then he says, the
  • 19:38observation of the symptoms is
  • 19:40extremely fallible, depending not only
  • 19:42on the personal condition of
  • 19:43the doctor who has possibly
  • 19:44been dragged to the case
  • 19:45by his night bill after
  • 19:46an exhausting day. I love
  • 19:47that that also hasn't changed
  • 19:49in over a century. But
  • 19:50upon the replies of the
  • 19:51patient to questions which are
  • 19:52not always properly understood and
  • 19:54for lack of the necessary
  • 19:55verbal skill could not be
  • 19:56properly answered if they were.
  • 19:58From such sources of error,
  • 19:59machinery is free. So when
  • 20:01I talk about artificial intelligence,
  • 20:02we are not talking about
  • 20:03anything new. These ideas have
  • 20:05been around for a long
  • 20:06time. And depending on how
  • 20:08you define a diagnostic AI,
  • 20:11the first artificial intelligence in
  • 20:12medicine was made about six
  • 20:14months after the first electronic
  • 20:15computer was made. So we've
  • 20:17been working on this for
  • 20:18a long time.
  • 20:19Historically,
  • 20:20AI clinical decision support
  • 20:22has had really big impacts,
  • 20:24though, in limited domains. One
  • 20:26of my favorite examples is
  • 20:27AAP help. This is, sometimes
  • 20:29called the Leeds abdominal pain
  • 20:30score is for patients presenting
  • 20:31with an acute abdomen.
  • 20:33A multicenter RCT showed a
  • 20:35mortality
  • 20:36decrease in acute abdomens of
  • 20:38twenty two percent.
  • 20:39That is huge. It's a
  • 20:41multicenter RCT. It was so
  • 20:43important that the US Navy
  • 20:45considered it an essential part
  • 20:46of our nuclear shield because
  • 20:47they put it on all
  • 20:48the, the submarines with ICBMs.
  • 20:50Because if a semen had
  • 20:50an acute abdomen, you needed
  • 20:52to know whether to take
  • 20:52that submarine and attack that
  • 20:54semen, in which case part
  • 20:55of your nuclear nuclear umbrella
  • 20:56went away. And then, of
  • 20:57course, there's, Internus One. I
  • 20:59don't know if anyone here
  • 21:00knew Jack Myers, but Internus
  • 21:01One is one of the
  • 21:02coolest AI systems of all
  • 21:03time. It was modeled on
  • 21:04the brain of Jack Myers,
  • 21:06sometimes called so his people
  • 21:07who knew him, Black Jack.
  • 21:09He is the reason we
  • 21:10no longer have oral boards
  • 21:11in medicine. He's an eidetic
  • 21:12memory chair of medicine at
  • 21:13Pitt. But an AI system
  • 21:15that was based on his
  • 21:15brain at the year nineteen
  • 21:16eighty two could solve the
  • 21:18New England Journal of Medicine
  • 21:19clinical pathological conference better than
  • 21:21any human could. So not
  • 21:23new technologies.
  • 21:24Now
  • 21:25one of the sad things
  • 21:26is that the technology really
  • 21:27stagnated. It worked really well.
  • 21:30A lot of these computational
  • 21:31strategies worked really well in
  • 21:32narrow
  • 21:33categories,
  • 21:34but from roughly nineteen ninety
  • 21:36to the year twenty twelve,
  • 21:38there wasn't really any change
  • 21:40in the performance. There's actually
  • 21:41a review from twenty twelve
  • 21:42that shows that differential generators
  • 21:44in twenty twelve worked exactly
  • 21:45as well as those that
  • 21:46had come twenty years before.
  • 21:47So real stagnation of the
  • 21:49computational techniques.
  • 21:50Now
  • 21:52language models are the most
  • 21:54exciting technology in the clinical
  • 21:56reasoning space to come out
  • 21:57in
  • 21:59forty years, basically since the
  • 22:00early nineteen eighties. I'm I'm
  • 22:02going to briefly explain
  • 22:04what they are, how they
  • 22:05work, and why they might
  • 22:06be exciting before actually demonstrating,
  • 22:09like, how it works on
  • 22:10a real case. So a
  • 22:12language model, if, anybody, I
  • 22:14should say, has everybody, who
  • 22:15here has not used a
  • 22:16language model before?
  • 22:19Oh. A couple people. So
  • 22:20most everyone, I imagine, has
  • 22:21at least put something into
  • 22:23ChatGPT before?
  • 22:24Yeah. I I think they're
  • 22:26pretty ubiquitous. I I believe
  • 22:27ChatGPT was the most downloaded
  • 22:29app in history. So, presumably,
  • 22:30most people have. So how
  • 22:32how does a language model
  • 22:33work? So it is a
  • 22:35it's a type of neural
  • 22:35network. It's actually a transformer.
  • 22:37But, effectively,
  • 22:38it's autopredict on steroids. So
  • 22:41let's say I'm sure everyone
  • 22:43has done this. You go
  • 22:43to Google. You type in
  • 22:44pathologists are, and it predicts
  • 22:46a bunch of different words.
  • 22:47If I type in internal
  • 22:48medicine physicians are, it'll probably
  • 22:49say, like, nerdy or awkward
  • 22:51or maybe it'll say smart,
  • 22:53but it has a lot
  • 22:53of predictions on what the
  • 22:54next word is based on
  • 22:56the tens of thousands of
  • 22:57searches that have come before.
  • 22:58A language model takes that
  • 22:59fundamental technology
  • 23:01and puts it on steroids.
  • 23:03So it takes
  • 23:04we don't not all of
  • 23:05human text, but virtually all
  • 23:07of human text. We don't
  • 23:08know what's in the training
  • 23:08corpus of the latest foundation
  • 23:10models, but we do know
  • 23:11for GPT three class models,
  • 23:12which is all of the
  • 23:13Internet,
  • 23:14a lot of pirated books,
  • 23:16a lot of human textual
  • 23:17material. We also know from
  • 23:18news reports that these companies
  • 23:19are doing things like scraping
  • 23:21YouTube videos, scraping podcasts. They're
  • 23:23hungry for textual data. So
  • 23:26this is the script, by
  • 23:26the way. Huge nerd here,
  • 23:27internal medicine physician. Shocker. This
  • 23:29is a script for Star
  • 23:30Trek two, the wrath of
  • 23:31Khan, which is because I'm
  • 23:32a Trekkie, but it's also
  • 23:33because GPT two had a
  • 23:34very famous experiment where you
  • 23:36trained it only on the
  • 23:37scripts from Star Trek because,
  • 23:38again, the people who built
  • 23:39these models were nerds, and
  • 23:41someone actually built a dataset
  • 23:42only on scripts of Star
  • 23:43Trek, which I love.
  • 23:44So
  • 23:45it breaks every
  • 23:47single,
  • 23:49word, every single piece of
  • 23:50text into units called tokens.
  • 23:52And a token is a
  • 23:53piece of a word. It
  • 23:55is a basic semantic unit
  • 23:57of a language model.
  • 23:58And just like in the
  • 23:59example where you type in
  • 24:00pathologist r and it predicts
  • 24:02the next word, it predicts
  • 24:03the next token based on
  • 24:04this huge training set of
  • 24:06information.
  • 24:07What makes it a large
  • 24:08language model is that it's
  • 24:10based on a technology called
  • 24:11a transformer, which doesn't just
  • 24:13predict the next token,
  • 24:14but it it predicts the
  • 24:15next token in a vector
  • 24:16string of tokens, which means,
  • 24:18semantically, it's predicting the next
  • 24:20word in the context of
  • 24:21a sentence, in the context
  • 24:22of a paragraph, in the
  • 24:23context of an entire book.
  • 24:25And because these are so
  • 24:26large and computationally intensive, it's
  • 24:29doing these these calculations. You
  • 24:30know, you go to Google,
  • 24:31you type it once, that's
  • 24:32one. It's doing that four
  • 24:33hundred billion times per parameter.
  • 24:36So massive amounts of calculations
  • 24:38to figure out what the
  • 24:40next token in a string
  • 24:41should be.
  • 24:42And now
  • 24:44I, as somebody well, I'm
  • 24:45losing my voice. I guess
  • 24:46I'm not surprised. I've been
  • 24:47talking nonstop for the last
  • 24:48twenty four hours. As someone
  • 24:49who studies human reasoning,
  • 24:51humans are both, like, wonderful
  • 24:52creatures, and we're
  • 24:54we're terrible sometimes. Right? So
  • 24:57we wrote the declaration of
  • 24:58the rights of man. We
  • 24:59wrote the, the Torah. We
  • 25:00wrote the Bhagavad Gita. We
  • 25:02also wrote Mein Kampf and
  • 25:03the website four Chan, and
  • 25:04all of those are encoded
  • 25:06in language models. So the
  • 25:08final step that makes these
  • 25:10models there is a pretraining
  • 25:11step that I won't go,
  • 25:11but the final step that
  • 25:12makes them so eerily human
  • 25:14is
  • 25:14reinforcement learning through human feedback.
  • 25:16Sometimes it's called fine tuning.
  • 25:18There's many ways to fine
  • 25:19tune, though. But the idea
  • 25:20here is that a human
  • 25:21being actually sits down and
  • 25:22talks a language model and
  • 25:24says, oh, it's a Skinner
  • 25:25box. Right? You did good
  • 25:26language model, you get a
  • 25:27cookie, or you did bad,
  • 25:28you get an electric shock.
  • 25:29It's a one or a
  • 25:30zero, but same idea.
  • 25:31And then over time, through
  • 25:33all these training cycles, it
  • 25:34makes them remarkably human. So
  • 25:36a fun fact about the
  • 25:37word delve, everybody knows that
  • 25:38ChatJPT loves to say, let's
  • 25:39delve into that. Right? And
  • 25:41there was a great study
  • 25:41that was just published, well,
  • 25:43about a year ago that
  • 25:44showed that in the scientific
  • 25:45literature, the word delve was
  • 25:46almost never used, and now
  • 25:47it's incredibly used because everyone's
  • 25:48using ChatJPT to help write
  • 25:50their papers. Well, why is
  • 25:51delve in there? It turns
  • 25:53out that OpenAI
  • 25:54used contract workers in Nigeria
  • 25:56and Kenya to do most
  • 25:57of their RLHF.
  • 25:58And DELV sounds weird in
  • 25:59American English, but in Kenyan
  • 26:01and Nigerian English, DELV is
  • 26:03commonly used. It's not doesn't
  • 26:04sound weird to them. So
  • 26:06now DELV is a big
  • 26:07part of American scientific literature
  • 26:10because of the RLHF of
  • 26:12a language model of contract
  • 26:13workers in Kenya and Nigeria.
  • 26:15So just one of those
  • 26:16fun things. And, of course,
  • 26:18what you get is something
  • 26:19that is remarkably human. You
  • 26:20know, I tell Chat GPT
  • 26:22I'm doing community theater. I
  • 26:23wanna do the wrath of
  • 26:24Khan. And it says, farewell,
  • 26:25noble admiral. Hold not your
  • 26:26breath. The enterprise cannot save.
  • 26:28She's marked for death. From
  • 26:29the sky, soon shall be
  • 26:30torn asunder, a fiery end,
  • 26:31a final echoing thunder. And
  • 26:33Shakespeare. Right? So he still
  • 26:34goes, Khan.
  • 26:36So I did just do
  • 26:37this so I could do
  • 26:37this meme. You're welcome. But
  • 26:39to also point out that
  • 26:41the technology that is underneath
  • 26:43this, a transformer,
  • 26:44can be used to do
  • 26:46output of any human creative
  • 26:47output. This is just DALL
  • 26:49E three. It's captain Kirk,
  • 26:50at the Globe Theater. DALL
  • 26:51E three is what's called
  • 26:52a diffusion model. There are
  • 26:54now video models. If anyone
  • 26:55has seen Sora, that's the
  • 26:56OpenAI video model. There's this
  • 26:58crazy thing. Like, there are
  • 26:59video games that are diffusion
  • 27:00video games, meaning that you,
  • 27:02like, walk through a video
  • 27:03game, and it renders
  • 27:04every single scene from an
  • 27:06AI, like, hallucination. And then,
  • 27:08of course, most concerningly, you
  • 27:09can clone people's voices really
  • 27:11effectively now, about ninety seconds
  • 27:13of audio, and you can
  • 27:14sound like anybody. And as
  • 27:15someone who has, like, thirty
  • 27:17hours of podcasting out, someone
  • 27:18could easily clone me. And
  • 27:19so if I call you
  • 27:20up, don't trust that it's
  • 27:21me, especially if I'm trying
  • 27:22to get your Social Security
  • 27:23number.
  • 27:24Okay. Why does this matter
  • 27:25for diagnosis?
  • 27:27Because that's cool. Right? I,
  • 27:29I've told a couple people
  • 27:30this because I was working
  • 27:31I I got into all
  • 27:32of this research because I
  • 27:33was working on my second
  • 27:34book, which was about clinical
  • 27:35reasoning. And I actually used
  • 27:36I talked to a bunch
  • 27:37of data scientists before. I
  • 27:38had used GPT three before,
  • 27:41like, CATGPT was released. And
  • 27:43I thought it was stupid,
  • 27:44and it wasn't gonna accomplish
  • 27:45anything, and I was not
  • 27:46impressed at all. So
  • 27:47with that caveat, maybe you
  • 27:49shouldn't listen to anything that
  • 27:50I say. But, like, why
  • 27:51do language models appear to
  • 27:53work in diagnosis? So this
  • 27:54is a real case of
  • 27:55mine. This is a patient
  • 27:56in whom I made a
  • 27:57diagnostic error.
  • 27:58The patient died.
  • 28:00And what I will say
  • 28:01is that the patient was
  • 28:03like, I am friends with
  • 28:04his wife. I have permission
  • 28:05from his family. He was
  • 28:06also a huge trekkie. So
  • 28:07this patient had night sweats,
  • 28:10monocytosis,
  • 28:11a daily fever, ground glass
  • 28:12opacities on the X-ray. He
  • 28:13had been treated for bladder
  • 28:14cancer, like, four or five
  • 28:15years before, BCG.
  • 28:18He did not improve with
  • 28:19antibiotics, which is when I
  • 28:20met him in the hospital.
  • 28:21He still had the fevers
  • 28:22despite high dose antibiotics.
  • 28:23Liver enzymes were crazy. And
  • 28:24then he had spinal hardware
  • 28:26in his back from previous
  • 28:27surgeries. And,
  • 28:29I know, about two weeks
  • 28:30before JPT four was released,
  • 28:32I got back from the
  • 28:33state lab what he had
  • 28:34actually died from. So this
  • 28:35poor man, we thought he
  • 28:35had culture negative endocarditis.
  • 28:37He died from a freak
  • 28:38case of m bovis bacteremia,
  • 28:40presumably reactivated from his BC
  • 28:42BCG treatment. Quite rare,
  • 28:45but misdiagnosis.
  • 28:46It wasn't just me. Like,
  • 28:47I was the the hospitalist
  • 28:48with the residents. Like, there
  • 28:49were a lot of doctors
  • 28:50involved. But, you know, one
  • 28:51of the first things that
  • 28:51I did when GPT four
  • 28:53came around is I I
  • 28:54asked for a second opinion
  • 28:56based on my problem representation
  • 28:57at the time. And what
  • 28:59happens is that it says,
  • 29:01you know, the the number
  • 29:02one diagnosis was what the
  • 29:03patient died from, and the
  • 29:04number two diagnosis was what
  • 29:06I was wrong about.
  • 29:08So, you know,
  • 29:09this is me in twenty
  • 29:11twelve, and my first thought
  • 29:12is what if I had
  • 29:13had this technology six months
  • 29:14ago?
  • 29:15Would this have changed anything?
  • 29:17And we actually don't know
  • 29:18the answer because second opinions,
  • 29:20we haven't there's not great
  • 29:21studies of second opinions, at
  • 29:23least in internal medicine. We
  • 29:24do know
  • 29:25that there are discrepancies that
  • 29:27being quite large, these are
  • 29:28pathological diagnoses, so these are
  • 29:30actually pathology studies.
  • 29:32There's only one good perspective
  • 29:33study on this from the
  • 29:34Netherlands, which did find, like,
  • 29:37frequent switching of diagnoses
  • 29:39and improved symptoms. The patients
  • 29:41actually did better when they
  • 29:42got a second opinion.
  • 29:43But, you know, the data
  • 29:45is pretty early. So how
  • 29:47do language models presumably do
  • 29:49a good job at diagnosis?
  • 29:51Well,
  • 29:52our we've done some cool,
  • 29:53like, ablation studies. Like, I
  • 29:55my job is really cool.
  • 29:56I basically gotta do psychology
  • 29:57on both humans and machines,
  • 29:58but we can do ablation
  • 29:59studies where we knock out
  • 30:00parts of the language model
  • 30:01to figure out what's going
  • 30:02on. And what it appears
  • 30:04to happen is is that
  • 30:05the reason that language models
  • 30:06can make diagnoses is because
  • 30:07of their similarity
  • 30:09to how human brains make
  • 30:10diagnoses. Right? So if you
  • 30:11think about token prediction, you
  • 30:13think about the log probs
  • 30:14of different words,
  • 30:16that's what a script is.
  • 30:17So LLMs are basically
  • 30:20system one on steroids, and
  • 30:22they encode far more knowledge
  • 30:24than we do.
  • 30:25They are not perfect, but
  • 30:27it does appear that this
  • 30:28is what gives them their
  • 30:29remarkable abilities in diagnosis.
  • 30:31So with that, wait, who
  • 30:33is who is my volunteer
  • 30:34internist for this?
  • 30:36Yeah.
  • 30:38That's it. You work on
  • 30:39Yelp?
  • 30:40So I am going to
  • 30:42show an example of
  • 30:44how this technology works.
  • 30:46Oh, jeez.
  • 30:47So no. No. Five years.
  • 30:48It's I'm sorry. No. This
  • 30:51is, so I'll just explain
  • 30:53what we're doing. So right
  • 30:54now, if you go to
  • 30:54the BI and you get
  • 30:56admitted to the emergency room
  • 30:57and you meet certain inclusion
  • 30:58or exclusion criteria, you will
  • 30:59be pulled into our data
  • 31:00pipeline
  • 31:01to study the effect of
  • 31:04different,
  • 31:06second opinions at different what
  • 31:07we call diagnostic touch points.
  • 31:09Because one of the ideas
  • 31:10is when you evaluate these
  • 31:11things, they,
  • 31:13it depends on the information
  • 31:14density. Right? Real diagnosis, you
  • 31:16don't get a case vignette.
  • 31:17You're often operating in
  • 31:19poor information settings. So what
  • 31:21we are doing here is
  • 31:22I am going to, walk
  • 31:24you through this is all
  • 31:25real material. I've stripped it
  • 31:26of PHI. This is a
  • 31:27real patient.
  • 31:28Just so you know, it's
  • 31:29not a I intentionally just
  • 31:31chose a random patient. This
  • 31:32is not a zebra. Okay.
  • 31:34This is not like a
  • 31:34CPC. So what I want
  • 31:36you to do is this
  • 31:37is the information that was
  • 31:39available to the port emergency
  • 31:40room resident, and I want
  • 31:41you to tell me what
  • 31:42your thinking is, what you
  • 31:44would wanna do, and then
  • 31:45I'm going to ask the
  • 31:46model, and you can tell
  • 31:47me how that changes your
  • 31:48thinking.
  • 31:49Okay. So I I'll read
  • 31:50it out loud if you
  • 31:51want even though I'm losing
  • 31:52my voice. So this is
  • 31:53a a young woman who
  • 31:54walks into our emergency room.
  • 31:56This is the ED triage
  • 31:57note. So this is someone
  • 31:58this history is taken not
  • 31:59in the emergency room, but
  • 32:00in the waiting room, and
  • 32:01a nurse has taken this.
  • 32:03So,
  • 32:04chief complaint, they have to
  • 32:05put in a ICD code.
  • 32:07So this is chest pain,
  • 32:08tachycardia.
  • 32:09The triage history, patient reports
  • 32:10a new PE diagnosis and
  • 32:11left lower extremity DBT at
  • 32:13an outside hospital five days
  • 32:14ago. Put an Eliquis ten
  • 32:15BID since four days ago.
  • 32:17Patient now arrives here with
  • 32:18worse than chest pain, cough,
  • 32:19and tachycardia.
  • 32:20The nurse appropriately put this
  • 32:22to the highest severity level,
  • 32:24one, which means that the
  • 32:25doc she goes back immediately
  • 32:26and the doctor sees her.
  • 32:27And vitals, fever, one zero
  • 32:29one. Heart rate, one forty.
  • 32:30Reperatory rate, twenty six. Blood
  • 32:32pressure and o two sats
  • 32:33are fine.
  • 32:35K.
  • 32:38So you want me to
  • 32:39Yeah. Just say what you
  • 32:39would what you would do.
  • 32:42Okay. So,
  • 32:43let's get a chest x-ray.
  • 32:46Let's get an eye
  • 32:49Let's get a chest x-ray.
  • 32:51Let's
  • 32:51get,
  • 32:56I'm worried she is a
  • 32:58I'm worried she still has
  • 32:59a PE. Yeah. I'm worried
  • 33:00she has either,
  • 33:02I'm worried she has a
  • 33:03new infection now for some
  • 33:04reason, and I'm worried she
  • 33:05has, like, a,
  • 33:07like, a dissection possibly. Correct.
  • 33:10So so the top your
  • 33:11top worries are new PE
  • 33:13or worse than PE infection,
  • 33:15something like an aortic dissection.
  • 33:20Me and Max don't get
  • 33:21along.
  • 33:23Okay.
  • 33:25So one of the questions
  • 33:26that I always get when
  • 33:27people come to my lab
  • 33:28is they're like, can I
  • 33:28see the AI? Because they
  • 33:29think there's something really cool
  • 33:30when you see an AI.
  • 33:31The reality is that it's
  • 33:32a it's an Excel spreadsheet.
  • 33:34It's a JSON database. So
  • 33:36it's not exciting. So what
  • 33:37I'm doing here this is
  • 33:38the model I showed you
  • 33:38earlier. Is this is not
  • 33:40the model we're using. In
  • 33:41reality, I'm using a llama
  • 33:42three model. This is o
  • 33:43one, which is the latest
  • 33:45and greatest model from OpenAI
  • 33:47that really freaks me out.
  • 33:48But, we'll see what it's
  • 33:50going to show. And to
  • 33:51be clear, I haven't run
  • 33:52this three zero one yet,
  • 33:53so I don't know what
  • 33:54it's going to show.
  • 33:55Okay.
  • 33:56So
  • 33:58wow. It's going fast.
  • 33:59It agrees with you that
  • 34:01number one on this differential
  • 34:02is a worsening pulmonary recurrent
  • 34:03or worsening pulmonary embolism. Infection
  • 34:05is the number two. It's
  • 34:06picking up that fever. Could
  • 34:07this be a pneumonia? Number
  • 34:09three, it's now considering also
  • 34:10pericarditis.
  • 34:11Could pericarditis be going on?
  • 34:13Fair enough.
  • 34:15ACS, it's considering.
  • 34:17Again, infection,
  • 34:19and then subtherapeutic
  • 34:20anticoagulation,
  • 34:22pneumothorax.
  • 34:22These are all very unlikely
  • 34:24or COVID nineteen infection. And
  • 34:25it also wants very similar
  • 34:27things to you. It wants
  • 34:27an EKG, a chest X-ray.
  • 34:28It wants a repeat CTA.
  • 34:30This is this is its
  • 34:31management plan. What what do
  • 34:32you does any of this
  • 34:33change your thinking?
  • 34:35No.
  • 34:37Is it helpful?
  • 34:39It's It's okay to say
  • 34:41no. I mean, it just
  • 34:42kind of confirms
  • 34:43what I was
  • 34:45thinking anyway, I guess. Yeah.
  • 34:46So it's it's confirmatory. It
  • 34:47makes you more confident in
  • 34:48what you were thinking.
  • 34:51Okay. On to the next
  • 34:52aliquot. So this is very
  • 34:54unique to our workflow. So
  • 34:55what happens next is the
  • 34:57poor ED resident who presumably
  • 34:58has like twenty other patients
  • 35:00has to come immediately to
  • 35:01see this patient because it's
  • 35:02an ESI of one.
  • 35:03The ED resident, she writes,
  • 35:05patient is a young female
  • 35:06presented with the ED for
  • 35:07chest pain and worsening shortness
  • 35:09of breath. Patient notes that
  • 35:10she has a history of
  • 35:11lupus, was diagnosed with a
  • 35:12PE five days ago after
  • 35:13having a CT for shortness
  • 35:14of breath and chest pain,
  • 35:15started on Eliquis. I think
  • 35:17all of this is the
  • 35:17same. Today, she had worsening,
  • 35:19shortness of breath, chest pain,
  • 35:20worsening palpitations, which prompted her
  • 35:22to present to the ED.
  • 35:23And then she was triggered
  • 35:24at all of this, we
  • 35:25already know. So this is
  • 35:26all other stuff. She's a
  • 35:26history of lupus. She's a
  • 35:28history of lupus. So that's
  • 35:29and that this just started
  • 35:30today are the additional things
  • 35:32that the ED resident picked
  • 35:33Does any of that change
  • 35:34your thinking?
  • 35:35I'm worried
  • 35:36now more that she's got,
  • 35:38like, a cons like, a
  • 35:38constrictive pericarditis
  • 35:41or
  • 35:42but the the PE is
  • 35:43still number one. Yeah.
  • 35:45So you're considering other things,
  • 35:46but it hasn't really changed
  • 35:48yet.
  • 35:50Okay. Let's see. Oops.
  • 35:56You would think I know
  • 35:57how to use a Mac.
  • 35:58It's embarrassing.
  • 36:05So what I love yeah.
  • 36:07Turn this off. What I
  • 36:08love about this is you
  • 36:09can actually see what the
  • 36:10model itself is thinking. So
  • 36:11the model, again, similar to
  • 36:12me, the fact that it
  • 36:13started on just one day
  • 36:14means that something like, that's
  • 36:16acute.
  • 36:17So it's changing its thinking
  • 36:19also. Let's see. Did it
  • 36:20pick up on the lupus
  • 36:21here? Yes.
  • 36:24Ah, now that's this is
  • 36:25why I like seeing its
  • 36:26thinking because you can see
  • 36:27it's like, okay. A clot
  • 36:29in lupus, could it be
  • 36:30antiphospholipid antibody syndrome? So still,
  • 36:33like you, it thinks that
  • 36:34a recurrent or worsening PE
  • 36:35is still the number one
  • 36:36thing on the differential. It's
  • 36:37now considering antiphospholipid
  • 36:39antibody syndrome on the differential,
  • 36:41which I think is reasonable
  • 36:42given that additional history of
  • 36:44lupus.
  • 36:45Pericardial fusion. So what you
  • 36:47were mentioning, could this be
  • 36:48pericardial fusion? It's actually worried
  • 36:50about tamponade. I don't know
  • 36:51why because the blood pressure
  • 36:52is normal, but it mentions
  • 36:53that. Blood pressure is normal.
  • 36:55I so no hypotension.
  • 36:57Arrhythmia, pericarditis,
  • 36:59still considering ACS and pneumothorax.
  • 37:02It does put panic attack.
  • 37:03And then what it wants,
  • 37:04I don't think it's changing
  • 37:05what it wants. It wants
  • 37:06an EKG, a CTPA, a
  • 37:07TTE, chest x-ray, all the
  • 37:09standard things.
  • 37:10Does this change like, did
  • 37:12this
  • 37:13second opinion change your thinking
  • 37:14at all?
  • 37:16Really? Does it make you
  • 37:17feel more confident or not?
  • 37:20Yes.
  • 37:22Yeah. It makes me feel
  • 37:23yeah. I guess it makes
  • 37:24me feel more confident.
  • 37:26It's okay to say nothing.
  • 37:28Priest health care is last
  • 37:29week.
  • 37:31Well, that's I mean, the
  • 37:32we get we'll get into
  • 37:33this. Right? So if you
  • 37:34were to give AI generated
  • 37:36second opinions even if very
  • 37:37effective, it might just lead
  • 37:38to overtreatment of everything.
  • 37:40So a lot of what?
  • 37:41We'll talk we'll talk about
  • 37:42that after. Because this is
  • 37:43a this is a big
  • 37:44concern about when and in
  • 37:46what situation you should do
  • 37:47this. Okay. Exam isn't gonna
  • 37:49help you any. I'll put
  • 37:50it in the system, but
  • 37:51this is the ED documented
  • 37:52exam. K. They document it
  • 37:54as completely normal. Okay. I
  • 37:56this is not one of
  • 37:57my like, literally, this is
  • 37:58a random patient that
  • 38:01I picked out. So I
  • 38:02have no idea if it
  • 38:03was actually normal. I'm just
  • 38:04gonna put that in there
  • 38:05so it knows. And then
  • 38:05we'll move on to the
  • 38:06next piece of information, which
  • 38:07is the imaging. So
  • 38:09the resident
  • 38:10orders actually pretty much everything
  • 38:12that you ask for. They
  • 38:14also order an EKG. It
  • 38:15is a problem in my
  • 38:16data pipeline that I'm not
  • 38:17able to pull in EKGs
  • 38:18yet, so there is no
  • 38:18EKG here. But, X-ray shows
  • 38:20bibasilar opacity is compatible with
  • 38:22small bilateral pleural effusions.
  • 38:26CTA,
  • 38:27right,
  • 38:28lower,
  • 38:29right lower lobar and segmental
  • 38:31PE without right heart strain
  • 38:32and a small pericardial effusion,
  • 38:34and then this bilateral
  • 38:36axillary lymph nodes,
  • 38:38as well as this hypodensity
  • 38:40in the liver. TTE is
  • 38:41performed. Big picture, there's no
  • 38:43strain seen on the TTE.
  • 38:45No tamponade.
  • 38:46But so, yeah, those images,
  • 38:47does that change your thinking
  • 38:48at all or pretty much
  • 38:49where you are?
  • 38:51It it I mean, it
  • 38:53knocks down pneumonia
  • 38:54a little bit. It knocks
  • 38:56down, like
  • 38:57she's sounds like her cardiopulmonary
  • 38:59silhouette is normal, so, like,
  • 39:00I'm not worried. Like, she's
  • 39:01got a big
  • 39:02pleural effusion or anything like
  • 39:04that.
  • 39:07Yeah. I think
  • 39:08I'm still
  • 39:09on the on the PE
  • 39:11train. You're still on the
  • 39:12PE train. Okay.
  • 39:14Let me
  • 39:16try to not freak out
  • 39:17my AI model too much.
  • 39:19Oh, why am I scrolling
  • 39:21the wrong direction? Embarrassing Adam.
  • 39:23But I want an EKG.
  • 39:26I don't have it.
  • 39:28This is epic. This is
  • 39:29a snowflake problem.
  • 39:31The EKGs are stored in
  • 39:33another database, so they're actually
  • 39:34not very easy to pull
  • 39:35in until this is the
  • 39:36problem. Until the cardiologist confirms
  • 39:38the read, and then you
  • 39:39can extract it. So because
  • 39:40my data pipeline is running
  • 39:42live, none of our patients
  • 39:43have EKGs.
  • 39:46So this is what we
  • 39:47were talking about. So much
  • 39:48so much of this comes
  • 39:48to, like, understanding where the
  • 39:49data comes from and the
  • 39:50limitations.
  • 39:51Okay. So let's see
  • 39:54how the AI model has
  • 39:55changed its thinking. So like
  • 39:57Hume, it still thinks recurrent
  • 39:59or worsening PE is the
  • 40:00number one diagnosis.
  • 40:01It still is worried about
  • 40:04anaphosolipid antibody syndrome. But sorry.
  • 40:07Can I just say Yeah?
  • 40:07Please. Anaphosolipid syndrome is not
  • 40:09causing her acute, like
  • 40:12I know. So, like, it
  • 40:13doesn't like, that's
  • 40:15it can say that all
  • 40:16at once, and it might
  • 40:17be APS, but, like, she's
  • 40:18still if she's the APS
  • 40:20is causing something. Right. Right.
  • 40:22Right. I it doesn't That
  • 40:24doesn't help
  • 40:25you.
  • 40:26Well, it helps you down
  • 40:27the line, but not It
  • 40:28doesn't help you in the
  • 40:29acute setting. Exactly.
  • 40:31Pericardial effusion, possibly lupus related
  • 40:33infection,
  • 40:34pericarditis,
  • 40:35arrhythmia.
  • 40:36Yeah. None of these these
  • 40:36are all pretty much things
  • 40:37that were on your differential.
  • 40:38Right?
  • 40:40None of this changes your
  • 40:41thinking at all. Doesn't. Except
  • 40:42except to get annoyed because
  • 40:44you're like
  • 40:45even if it's anti antiphospholipid
  • 40:47antibody syndrome, it's still a
  • 40:48pee.
  • 40:49Okay.
  • 40:51Labs,
  • 40:52I don't know that this
  • 40:53is gonna help you much,
  • 40:54but she admit our lactate
  • 40:55cutoff is one point six,
  • 40:56so this is a slightly
  • 40:57elevated lactate. Sodium is one
  • 40:59thirty one.
  • 41:00These other labs are all
  • 41:02relatively normal. A troponin was
  • 41:04negative. That's a a negative
  • 41:05proBNP.
  • 41:06She is anemic. Hemoglobin is
  • 41:08nine one.
  • 41:09Her INR is elevated one
  • 41:11point seven,
  • 41:12with a PTT of thirty
  • 41:13five.
  • 41:14That diff is normal, and
  • 41:16then when they repeated the
  • 41:17lactic acid after fluids, it
  • 41:18was one five, which is
  • 41:20one below the cutoff. So
  • 41:21it's normal in our system,
  • 41:22and the repeat retardant was
  • 41:23negative. I'm going to well,
  • 41:25did the labs change anything
  • 41:26for you? No. Yeah. I
  • 41:27I wouldn't think they would.
  • 41:29And let's see if they
  • 41:30change anything via AI model.
  • 41:42Come on. Show me what
  • 41:43you're doing.
  • 41:47You're running a different chat
  • 41:49GPT.
  • 41:50This is very slow compared
  • 41:51to my experience.
  • 41:53This is, a new model
  • 41:54called o one. You can
  • 41:55see what it's doing. It
  • 41:57is using an internalized chain
  • 41:58of thought process
  • 41:59that, well, that's what it's
  • 42:01doing. It's thinking it's think
  • 42:02thinking through different steps. So
  • 42:04that's why it's going so
  • 42:05slow. In reality, we're running
  • 42:06this on a core with
  • 42:07a llama model that's like
  • 42:08it's like that. It's very
  • 42:09fast.
  • 42:11Okay.
  • 42:12So I'm gonna guess it's
  • 42:13gonna say the same things,
  • 42:14but let's check it out.
  • 42:16Recurrent PE,
  • 42:18it's it's not changing. Right?
  • 42:19So it's basically saying the
  • 42:20same thing.
  • 42:23I don't think any of
  • 42:23these are different.
  • 42:25Okay.
  • 42:28So I'll just go over
  • 42:29the last bit because our
  • 42:31final touch point is when
  • 42:31medicine sees the patient, so
  • 42:33when medicine actually admits the
  • 42:34patient. So the, I bolded
  • 42:36the relevant thing. So, the
  • 42:38ED no. Sorry. The medicine
  • 42:40intern sees the patient still
  • 42:41in the ED, and the
  • 42:42patient says,
  • 42:44Oh, this happened to me
  • 42:45before six years ago. I
  • 42:47had a lupus flare with
  • 42:48very similar symptoms,
  • 42:49and, the medicine resident finds
  • 42:51out that the patient
  • 42:53outpatient rheumatologist
  • 42:54for the last couple weeks
  • 42:55has felt that she's having
  • 42:56a lupus flare and has
  • 42:57been modifying her medications
  • 42:59and that these are the
  • 43:00current medications. So methotrexate,
  • 43:02twenty q seven days, which
  • 43:03has just been increased, plaquenil,
  • 43:05folic acid, and then apixaban,
  • 43:07which is new.
  • 43:08Does any of that
  • 43:11change
  • 43:13your
  • 43:16thinking?
  • 43:18It's okay to say no.
  • 43:20No. It it it doesn't.
  • 43:21It's just you know, I
  • 43:22think there's
  • 43:24two things going
  • 43:27on.
  • 43:30Do you want me to
  • 43:30tell you what the final
  • 43:32Like, I guess my question
  • 43:33is, is it pleuritic?
  • 43:34Is it pleuritic? Yeah. You
  • 43:36wanna know what the final
  • 43:36diagnosis was from the the
  • 43:37medicine team? Is it is
  • 43:39it a lupus flare? It's
  • 43:40lupus flare. Yeah. So the
  • 43:41the final diagnosis is actually
  • 43:43that this the poor woman
  • 43:44has pericarditis and pleuritis from
  • 43:46a lupus flare and had
  • 43:47a PE secondary to that.
  • 43:49So it is two different
  • 43:50things going on. And, you
  • 43:52know, the ED team did
  • 43:54everything completely appropriate. Right? When
  • 43:56you have a patient like
  • 43:56this come in, obviously, you
  • 43:57wanna make sure they're not
  • 43:58having something devastating. So it's
  • 43:59not even that this patient
  • 44:00was mismanaged
  • 44:02in any way.
  • 44:03It's just not what the
  • 44:05initial diagnosis was. So I
  • 44:06was more curious.
  • 44:07Come on.
  • 44:09It's very slow. It's being
  • 44:10very finicky. So I'm curious
  • 44:12what the AI model is
  • 44:13going to end up saying.
  • 44:15You can see how slow
  • 44:16it is. This is all
  • 44:17the different steps.
  • 44:20Well
  • 44:21Friction
  • 44:23I did not see this
  • 44:25patient. This is all mediated
  • 44:27through the physical through the
  • 44:28chart, and I will tell
  • 44:29you that in our data
  • 44:30pipeline, which pulls in only
  • 44:31two different diagnostic
  • 44:33physical exams, no one documented
  • 44:35it, that doesn't mean she
  • 44:36didn't have it.
  • 44:40I highly doubt that the
  • 44:41EP resident did the exam
  • 44:42later. Okay. So let's see
  • 44:43what the model says.
  • 44:45Okay. So this is endo
  • 44:47it it this is the
  • 44:48right final diagnosis, which had
  • 44:49lupus flare with psoriositis.
  • 44:51And then the number two,
  • 44:52the patient has, it's a
  • 44:54pulmonary embolism. And, of course,
  • 44:55what we found out or
  • 44:57I'm not this patient's doctor,
  • 44:58but what ended up happening
  • 44:59is they get the out
  • 45:00the outside CTA, and it
  • 45:01shows that, in fact, the
  • 45:03PE is no larger. It's
  • 45:04even a little bit smaller.
  • 45:05So this is not a
  • 45:05recurrent PE. The symptoms were
  • 45:07likely driven from pericarditis and
  • 45:09pleuritis, so lupus serositis.
  • 45:12The patient was worked up
  • 45:13for antiphospholipid antibodies from all
  • 45:14the tests were negative. So
  • 45:16that's the that's the final
  • 45:17diagnosis. So reflecting back, like,
  • 45:19would this have been helpful
  • 45:20if you were getting this
  • 45:21and would have driven you
  • 45:22in the wrong direction?
  • 45:25I don't know if it
  • 45:26would have driven me in
  • 45:27the wrong direction, but it's
  • 45:30it was confirmatory.
  • 45:31But I mean,
  • 45:33I think
  • 45:35well, two things. One, we're
  • 45:36in an ed setting. So,
  • 45:38you know, I think the
  • 45:39most important thing is ruling
  • 45:41out the things that are
  • 45:41gonna kill her in the
  • 45:42next hour
  • 45:44so, you know
  • 45:46It might be a lupus,
  • 45:47but you don't want to
  • 45:48miss right you don't want
  • 45:48to miss a pe or
  • 45:49a dissection
  • 45:51I guess I don't know
  • 45:51if that's anchoring, but, like,
  • 45:53that is my top things
  • 45:54are I wanna make sure
  • 45:55she's not tamponading. I wanna
  • 45:57make sure she's not having
  • 45:58a massive PE. I wanna
  • 45:59make sure she's not having,
  • 46:00like, a dissection
  • 46:02or she's having, like, a
  • 46:03big heart attack. So, like,
  • 46:05other than that,
  • 46:06it didn't really change anything.
  • 46:08Because you would have done
  • 46:09everything exactly the same. And
  • 46:10you were considering those cannot
  • 46:12miss diagnoses from the very
  • 46:13beginning, obviously.
  • 46:14Yeah. Yeah. I don't think
  • 46:15it would have changed much.
  • 46:17Would it have driven you
  • 46:18in the wrong direction? Right?
  • 46:18Would getting a second opinion
  • 46:20like this have made you
  • 46:21second guess yourself or I
  • 46:23think it would have made
  • 46:24me order more tests.
  • 46:26And in in particular, like,
  • 46:28in the ED, you would
  • 46:28have ordered a bunch of
  • 46:29those tests?
  • 46:31Maybe not in the ED,
  • 46:32but, like, it was I
  • 46:33saw it. It was, like,
  • 46:34get a cardiac MRI.
  • 46:36I hope no one well,
  • 46:37it wouldn't matter. Cardiology wouldn't
  • 46:38do the cardiac MRI, but,
  • 46:39yes, that is not an
  • 46:41appropriate test for this workout.
  • 46:42Yeah. But,
  • 46:43yeah, maybe order more I
  • 46:45I might have ordered more
  • 46:46labs if I'm being honest.
  • 46:47Yeah.
  • 46:48Alright. Well, that's thank you
  • 46:50very much. I'll give you
  • 46:51a a hand. I was
  • 46:52very
  • 46:53sorry that I made you
  • 46:54breakfast in general medicine again.
  • 46:56You thought you escaped.
  • 46:59No. That that's oops. Well,
  • 47:01that's cool that you have
  • 47:02a, one of these things
  • 47:03here. So that's, I mean,
  • 47:04this is an example
  • 47:06of what it looks like
  • 47:08in practice in a randomly
  • 47:09selected case. And you can
  • 47:10start to already see when
  • 47:11you go through it some
  • 47:12of the challenges of implementing
  • 47:14a system like this. I'm
  • 47:15gonna go over some of
  • 47:16the data, including some of
  • 47:17the new data before, seeing
  • 47:18if anybody has any questions
  • 47:19and before I lose my
  • 47:20voice.
  • 47:22So LLMs encode lots of
  • 47:23knowledge.
  • 47:24I'm sure that everyone saw
  • 47:25that, you know, it can
  • 47:26pass the USMLE.
  • 47:27I don't care about this.
  • 47:29You guys should not care
  • 47:30about this either. It turns
  • 47:31out, this is actually from
  • 47:32some of our interesting ablation
  • 47:33studies, that LLMs'
  • 47:35performance on exams
  • 47:37has less to do
  • 47:38what they're doing is that
  • 47:40they have learned the semantic
  • 47:41structure of multiple choice questions,
  • 47:43meaning that they are effectively
  • 47:45good test takers.
  • 47:46Some of my colleagues did
  • 47:47a really cool experiment where
  • 47:48they made up two new
  • 47:49organ systems, and then they
  • 47:50had test writers write up
  • 47:51multiple choice questions with those
  • 47:52fake organ systems. And the
  • 47:54LLMs still did really well
  • 47:55on it because they learned
  • 47:56to understand what a question
  • 47:58looks like and guess the
  • 47:59right answer from that. And
  • 48:00I think everyone here knows
  • 48:01that if you're being honest
  • 48:02with yourself about what multiple
  • 48:03choice like, you start by
  • 48:05excluding a couple things. We
  • 48:06all know how that works.
  • 48:07So none of that matters.
  • 48:09This empathy thing, I think,
  • 48:10is overplayed also. You should
  • 48:12know that
  • 48:13this is the justification for
  • 48:15having LLMs write portal messages,
  • 48:16that patients find their communication
  • 48:18more empathetic.
  • 48:19The standard, of course, is
  • 48:21compared to a very overstretched
  • 48:23PCP who's just trying to
  • 48:24commute
  • 48:25communicate your CBC results. So
  • 48:26these are not actually empathy
  • 48:28in person
  • 48:29communications, but at least in
  • 48:30written communications, people do find
  • 48:32the LLM to be more
  • 48:33empathetic.
  • 48:34For what I care about,
  • 48:36LLMs are able to make
  • 48:37diagnoses
  • 48:38on a lot of the
  • 48:40benchmarks
  • 48:41that, like, the field has
  • 48:42accepted. LLMs have long since
  • 48:44surpassed humans,
  • 48:46but a lot of these
  • 48:47are relatively artificial because they're
  • 48:48very information dense settings, very
  • 48:50complicated diagnoses,
  • 48:52and a lot of when
  • 48:53you look at the diagnostic
  • 48:54errors are not coming from
  • 48:56lupus nephritis. It's coming from
  • 48:57people misdiagnosing
  • 48:59common things.
  • 49:01They have this is fascinating
  • 49:02they have an emergent probabilistic
  • 49:04reasoning,
  • 49:05so there's no reason that
  • 49:07semantic, like language should give
  • 49:09you a probabilistic understanding of
  • 49:11disease states, But in fact,
  • 49:12this is studied with Dan
  • 49:13Morgan. When you compare it
  • 49:14to large groups of humans,
  • 49:16they have a better sense
  • 49:17of understanding the pretest probability
  • 49:19of disease and how that
  • 49:20changes with subsequent tests. That
  • 49:22holds up
  • 49:23pretty well. The post test
  • 49:25probability of disease, it's not
  • 49:27really any better than humans,
  • 49:28but for after a positive
  • 49:30test, but after a negative
  • 49:30test, it's a lot better
  • 49:31than us.
  • 49:33They can forecast similarly well.
  • 49:34So if you ask it,
  • 49:36what do you think the
  • 49:36percentage chance of the final
  • 49:38diagnosis is? This this was
  • 49:39done in neurologists, ID doctors,
  • 49:41and pediatricians.
  • 49:42It outperforms every single individual
  • 49:44human in every single group
  • 49:46of humans,
  • 49:47only being beaten when you
  • 49:48take the best groups and
  • 49:49put them together.
  • 49:52They are able to display
  • 49:54reasoning.
  • 49:55So when it comes to,
  • 49:56like, how will you communicate
  • 49:58with a human,
  • 49:59there's this whole question of
  • 50:00human computer interaction. What would
  • 50:02an AI
  • 50:06LLMs actual cases, as new
  • 50:09information comes in and asks
  • 50:10them to update their thinking
  • 50:11and you compare that to
  • 50:12humans, it outperforms humans consistently.
  • 50:16It outperforms
  • 50:17attendings who outperform residents, and
  • 50:18there's no difference in efficiency,
  • 50:20accuracy, quality, or cannot misdiagnoses.
  • 50:22It does hallucinate more. So
  • 50:24this is actually pretty high
  • 50:25hallucination. Right? Right? It makes
  • 50:26up stuff twelve percent of
  • 50:27the times compared to only
  • 50:28three percent of the times
  • 50:29with humans. Now the hallucinations
  • 50:31are relatively minor. Some of
  • 50:32them are kind of funny.
  • 50:33One of them was a
  • 50:34patient who had diverticulitis,
  • 50:36and the L. M. Wanted
  • 50:37the human to keep gastroenteritis
  • 50:39in mind because the patient
  • 50:40had recently traveled to, Texas,
  • 50:42and going to Texas was
  • 50:43a risk factor for enterotoxigenic
  • 50:45E. Coli, which I'm pretty
  • 50:47sure is not true. So
  • 50:48that is a hallucination,
  • 50:49but probably not one that
  • 50:51would harm the patient.
  • 50:52This study was done by
  • 50:53my colleagues at Google,
  • 50:55very controversial when it came
  • 50:56out because what they did
  • 50:58is it's actually not a
  • 50:59very high performing model. It's
  • 51:00a palm two model, but
  • 51:01the model itself could solve
  • 51:02CPCs. You can see humans
  • 51:03are on the bottom. It
  • 51:04could solve clinical pathological conferences,
  • 51:06and they randomized humans to
  • 51:09either solve conferences themselves
  • 51:11using Google search, using the
  • 51:12AI model, or the AI
  • 51:13model itself. And this was
  • 51:15controversial when it came out
  • 51:16because when you gave humans
  • 51:17the AI model, it actually
  • 51:18made the model not perform
  • 51:19as well. So adding humans
  • 51:21into the mix lowered performance.
  • 51:23This is against
  • 51:24the kind of standard precepts
  • 51:26of the informatics field, so
  • 51:27quite controversial when it came
  • 51:28out. Unfortunately,
  • 51:30I I we my group
  • 51:31ran a large randomized control
  • 51:33trial looking at very nuanced
  • 51:34measures of reasoning in real
  • 51:35cases, so not CPCs,
  • 51:38and we found the same
  • 51:38thing. The human performance is
  • 51:41on the right by itself
  • 51:42in blue.
  • 51:43Humans using the AI are
  • 51:44in green, and the AI
  • 51:45model by itself is in
  • 51:46red. So we found the
  • 51:47same thing, and because we
  • 51:49did a very nuanced measure,
  • 51:50I can tell you why.
  • 51:51And it's humans, when the
  • 51:53AI model tells them that
  • 51:54they're wrong, disregard those pieces
  • 51:55of information. In particular,
  • 51:57humans don't like
  • 52:00the humans don't like an
  • 52:01AI model critiquing
  • 52:02or disconfirming the things that
  • 52:04they think.
  • 52:06Another randomized controlled trial that
  • 52:08was just accepted, this is
  • 52:09the one I just just
  • 52:09heard from nature.
  • 52:10So this is in management
  • 52:12decisions. Management decisions are notoriously
  • 52:14tricky to measure.
  • 52:15In this one, LLMs
  • 52:17did improve people's ability when
  • 52:18they used it to make
  • 52:19management decisions, when you randomized.
  • 52:21But when we look at
  • 52:21the subgroup, it's not what
  • 52:23you would think. Like, people
  • 52:24aren't using the LLM to
  • 52:25say, what is the right
  • 52:26dose of apixaban, or even
  • 52:27should I give apixaban? What
  • 52:28the LLM did was cue
  • 52:30them to, for example, apologize
  • 52:32after making a medical error
  • 52:33or communicate better with other
  • 52:35providers or take patient factors
  • 52:37into account when following a
  • 52:38likely cancerous nodule. So it
  • 52:40actually improved performance
  • 52:41not in, like, what we
  • 52:42think of as the standard
  • 52:43management
  • 52:45domains, but in things that
  • 52:46we think of humans are
  • 52:47good at.
  • 52:49A lot of the work
  • 52:49that I'm doing with Google
  • 52:50is on building models that
  • 52:51can collect data. This is
  • 52:53from the Omni system.
  • 52:55This is standardized patients, not
  • 52:57real patients, but a true
  • 52:58Turing test where a
  • 53:00standardized patients talk to a
  • 53:01a terminal, and they don't
  • 53:02know whether it's a human
  • 53:03or an AI on the
  • 53:04other side. And,
  • 53:06on twenty six of twenty
  • 53:07six patient domains, patients before
  • 53:09the AI, and in twenty
  • 53:11eight of thirty two axes
  • 53:12from the physician graders,
  • 53:14AI was preferred,
  • 53:15and this held up in
  • 53:16every single diagnostic category. So
  • 53:19we're running this in clinical
  • 53:20trials in human like in
  • 53:21actual patients now, It's still
  • 53:22performing quite well, but they're
  • 53:24increasingly able to collect data.
  • 53:27Now
  • 53:28the, the the unpublished data
  • 53:30that I'm about to show
  • 53:31you is from my grad
  • 53:31student that I put in
  • 53:32this presentation this morning because
  • 53:34the models continue to improve.
  • 53:36And if you would ask
  • 53:36me six months ago, I
  • 53:37would say we're seeing convergence
  • 53:39of model performance. I don't
  • 53:40think there's going to be
  • 53:41like, there's still gonna be
  • 53:43incremental improvements,
  • 53:44but this is for, this
  • 53:46is for
  • 53:47solving CPC. So this is
  • 53:48one of these benchmarks that
  • 53:49goes back almost sixty years.
  • 53:51And the new models have
  • 53:52surpassed everything that came before,
  • 53:54and you can see humans
  • 53:55are in brown in the
  • 53:55bottom. This is the one
  • 53:56that freaks me out because
  • 53:58this is not an HCI
  • 53:59study, right? This is just
  • 54:00looking at the human baseline,
  • 54:01but I showed you these
  • 54:02are real cases for the
  • 54:03diagnostic and management decisions.
  • 54:05The colors are different, but
  • 54:06these are the old graphs,
  • 54:07and this is the new
  • 54:08model on the left. And
  • 54:09you can see that the
  • 54:10new models are performing
  • 54:13in
  • 54:13a far better than any
  • 54:15other system, not only much
  • 54:16better than the humans, but
  • 54:17better than the previous AI
  • 54:18systems. So we're continuing to
  • 54:20see performance gains.
  • 54:22Eric Horvitz, the Microsoft group
  • 54:24published their, med prompt follow-up
  • 54:26on o one today, and
  • 54:27they actually they came to
  • 54:28the same conclusion as my
  • 54:29paper, which is, like, these
  • 54:30things have gotten so good
  • 54:31that we need new benchmarks.
  • 54:34We or clinical trials because
  • 54:35they are outperforming everything that
  • 54:37we throw at them.
  • 54:39I can go over these
  • 54:40quickly. In reality, so a
  • 54:42lot of tools are now
  • 54:43being used in clinical practice.
  • 54:45They're actually kind of underperforming
  • 54:46from what we were sold.
  • 54:47So if you look at
  • 54:48some of the early performance
  • 54:49of AI,
  • 54:51scribes, which I know you
  • 54:52guys are using here at
  • 54:53Yale and some of the
  • 54:54clinics, some of the early
  • 54:55studies actually suggest there is
  • 54:56no efficiency gain because they
  • 54:58hallucinate, and the doctors have
  • 54:59to go back and check
  • 55:01the models.
  • 55:02And, yeah, people like it,
  • 55:03but it's not really saving
  • 55:05anybody time and, you know,
  • 55:07people always care about money.
  • 55:08It's not saving anybody any
  • 55:09money either.
  • 55:10The same thing is happening
  • 55:12with the
  • 55:13patient portal messaging. One of
  • 55:15the very depressing things from
  • 55:16the JAMA study on this
  • 55:17is that it actually took
  • 55:18more time, seven percent more
  • 55:20time, physician time, when the
  • 55:22AI wrote the initial drafts
  • 55:23because it hallucinates or says
  • 55:25something harmful, and the doctor
  • 55:26has to go back and
  • 55:27edit it. Again, the patients
  • 55:28liked the responses more, but
  • 55:29it took the doctors more
  • 55:30time.
  • 55:31And then what everybody should
  • 55:33know, I don't think I
  • 55:34need to say this, but
  • 55:34LLMs are racist and sexist.
  • 55:36They actually encode
  • 55:38because they are trained on
  • 55:39our language and then fine
  • 55:40tuned by humans. They encode
  • 55:41all of the biases that
  • 55:43humans have. Now they do
  • 55:45appear I I just published
  • 55:46a study in, JAMA. They
  • 55:47appear to be less racist
  • 55:49and sexist than us, but
  • 55:50they are still racist and
  • 55:52sexist.
  • 55:52So in a world where
  • 55:54we're trying to get past,
  • 55:55like, race based medicine,
  • 55:57especially as LLMs get more
  • 55:58and more powerful, we should
  • 55:59know that they are showing
  • 56:01human, not only cognitive biases,
  • 56:03but racial and gender biases,
  • 56:04which is concerning.
  • 56:06And then we talked a
  • 56:08little bit, but the you
  • 56:08know, HCI is actually quite
  • 56:10challenging because
  • 56:12if used inappropriately, this technology
  • 56:14probably will drive overtreatment.
  • 56:16It
  • 56:17different people need different opinions
  • 56:19at different times. Like, a
  • 56:20second opinion is not universally
  • 56:21helpful.
  • 56:22Also, HCI is unpredictable. There's
  • 56:24a great study from some
  • 56:24of my colleagues at MIT
  • 56:26that showed that the best
  • 56:27radiologists
  • 56:28actually have their performance lowered
  • 56:30by a high performing AI
  • 56:31because they second guess themselves.
  • 56:33So just because an AI
  • 56:34model works well, even if
  • 56:36it consistently works well, like,
  • 56:38in silica, that in silico,
  • 56:40doesn't mean that it's actually
  • 56:41going to improve human performance
  • 56:42because, you know, again, we
  • 56:44are hairless apes that evolved
  • 56:45to, like, live on it'd
  • 56:47be hunter gatherers, and now
  • 56:48we're trying to do complex
  • 56:49medicine in the twenty first
  • 56:50century. So,
  • 56:52whew, I'm gonna lose my
  • 56:53voice. That is it for
  • 56:54this presentation. So if anybody
  • 56:56has any questions or wants
  • 56:57to talk about pathology, I
  • 56:58am happy to, entertain them.
  • 57:05And thank you very much.
  • 57:08Are the new models based
  • 57:10on the performance of the
  • 57:11previous models formed from Zetta
  • 57:13testing? So the new model,
  • 57:15Owen, this is really interesting,
  • 57:16has no new data in
  • 57:18it. There is it is
  • 57:19the same
  • 57:20data as Ford Turbo. So
  • 57:21the cutoff is like last
  • 57:23year. So what in is
  • 57:24improving its performance has nothing
  • 57:25to do with the training
  • 57:26data, but what they're doing
  • 57:27is chain of thought. So
  • 57:29if you get a model
  • 57:29to speak out at SOTA,
  • 57:30it does better. And what
  • 57:31they've done is reinforcement learning
  • 57:33for the chain of thought.
  • 57:34So they're teaching it how
  • 57:35to think out loud and
  • 57:36then reinforcing that over time.
  • 57:38So these models, these are
  • 57:39all computational techniques. It has
  • 57:40nothing to do with the
  • 57:41underlying data, and there's no
  • 57:43more scale. The parameters of
  • 57:44the model are exactly the
  • 57:45same, which is one of
  • 57:46the reasons I'm so freaked
  • 57:47out because I didn't think
  • 57:47we could get such impressive
  • 57:49performance gains without increasing the
  • 57:51number of parameters.
  • 57:54Yes.
  • 58:12Yeah. Yeah. So this is
  • 58:13a great question. Like,
  • 58:15what would it look like
  • 58:16in practice?
  • 58:17So
  • 58:18I'm assuming
  • 58:19we're talking Epic here. Right?
  • 58:21So the the reality of
  • 58:23the situation is that Epic
  • 58:24is working on clinical decision
  • 58:26support software.
  • 58:28This is not a huge
  • 58:29priority. If you look at
  • 58:30what Epic is working on,
  • 58:31they're mostly efficiency. They're working
  • 58:33on like tech summarization.
  • 58:34However,
  • 58:35Epic does make it fairly
  • 58:37easy to have a data
  • 58:38pipeline to put information in.
  • 58:40So even at my own
  • 58:41institution, we have a pipeline
  • 58:41through Amazon Web Services that
  • 58:41I can push a second
  • 58:42opinion in through the chart,
  • 58:42trivially
  • 58:43second opinion in through the
  • 58:44chart, trivially easy. Like, the
  • 58:45any any health system could
  • 58:46do this. Any third party,
  • 58:47there are vendors right now
  • 58:48who want to sell you
  • 58:49this technology. No one should
  • 58:50buy it, by the way,
  • 58:52because this is not tested,
  • 58:56and I'm pretty certain that
  • 58:57it will lead to worse
  • 58:58care if used, like, routinely
  • 59:01on every single patient. So
  • 59:02from a technological standpoint, you'd
  • 59:04need, like,
  • 59:06fifteen hours of a programmer's
  • 59:07time to build a pipeline
  • 59:08to do this. The question
  • 59:10becomes, like, what are the
  • 59:11other strategies that you're going
  • 59:12to do to make sure
  • 59:13that you're giving a second
  • 59:14opinion to the right person
  • 59:15at the right time? At
  • 59:16the BI and what we're
  • 59:17doing through the, Home Run
  • 59:19Network is we're looking at
  • 59:20serving second opinions at clinical
  • 59:22decompensation. So at the moment
  • 59:23that a patient's about to
  • 59:24go to the ICU, based
  • 59:26on this logic that the
  • 59:27patient is already really sick,
  • 59:30we do diagnostic timeouts anyway.
  • 59:32So this is just another
  • 59:33part of the diagnostic time
  • 59:34out. But, so a lot
  • 59:35of our work is like
  • 59:36looking at audit logs, trying
  • 59:37to get a sense on
  • 59:38which patients or which providers
  • 59:39need second opinions, and that's
  • 59:40much more computationally intense.
  • 59:43My guess is, like, in
  • 59:44five years, Epic will just
  • 59:45build this into Epic.
  • 59:48Yes.
  • 59:50So I mean, the question
  • 59:51is, how do I state
  • 59:53of,
  • 59:54let's say, current out as
  • 59:55well?
  • 59:56Is it capable of creative
  • 59:58information? And if not,
  • 01:00:00is it accurate to study?
  • 01:00:03You are asking the right
  • 01:00:04questions. So okay. What I'm
  • 01:00:05gonna say is controversial.
  • 01:00:08LLMs,
  • 01:00:10large language models, codify human
  • 01:00:11knowledge.
  • 01:00:12They are and there's actually
  • 01:00:14there are computational tests to
  • 01:00:15test their ability to be
  • 01:00:16creative for things outside of
  • 01:00:17their training set. I do
  • 01:00:19not think there is any
  • 01:00:20reason to think that any
  • 01:00:21large language model will ever
  • 01:00:23be able to be creative
  • 01:00:25in that outside of its
  • 01:00:26training set. They are effectively
  • 01:00:29locking in human knowledge. Now
  • 01:00:30they can be updated, and
  • 01:00:31they can read things and
  • 01:00:33integrate that new knowledge, but
  • 01:00:34they're still fundamentally limited by
  • 01:00:36what's in their training set.
  • 01:00:37And that gets to, like,
  • 01:00:38well, what are the impacts
  • 01:00:39for medicine? The fact of
  • 01:00:40the matter is when it
  • 01:00:40comes to diagnosis,
  • 01:00:42ninety nine, ninety eight times,
  • 01:00:44we're not being creative, but
  • 01:00:45sometimes that's necessary. And what
  • 01:00:47does this do for human
  • 01:00:48creativity? I mean, everyone's seen
  • 01:00:49this. When you work with
  • 01:00:50an LLM, you have it
  • 01:00:51write something. It's very average.
  • 01:00:53It's very milquetoast.
  • 01:00:54It literally is picking out
  • 01:00:56the average of its training
  • 01:00:57set. That's actually one of
  • 01:00:58the reasons it works well
  • 01:00:59in diagnosis, but there's gonna
  • 01:01:00be downstream effects and the
  • 01:01:02lack of creativity is one
  • 01:01:03of them.
  • 01:01:07Well, you mean in science
  • 01:01:08or in These are advanced.
  • 01:01:10Any that is that we
  • 01:01:11touch. I mean, this is
  • 01:01:12this is a very real
  • 01:01:13concern.
  • 01:01:15Yeah. You're not wrong.
  • 01:01:17I
  • 01:01:18I think this is is
  • 01:01:20that is that depressing? I'm
  • 01:01:21sorry. You're maybe looking for
  • 01:01:22a more optimistic answer there.
  • 01:01:24LLMs
  • 01:01:25will, I don't think, ever
  • 01:01:26be capable of creativity in
  • 01:01:28the way that a human
  • 01:01:29is.
  • 01:01:32Oh, I have all the
  • 01:01:32time in the world.
  • 01:01:34That's not true. But
  • 01:01:36What's sometimes specifically the physical
  • 01:01:39exam? You know, how can
  • 01:01:40we be doing a diagnostic
  • 01:01:42differential
  • 01:01:43based on the exam?
  • 01:01:45And and this is a
  • 01:01:46bigger question. I got really
  • 01:01:47the way,
  • 01:01:49the best side of,
  • 01:01:51schedules that we need,
  • 01:01:53or
  • 01:01:55errors. And there's also implications
  • 01:01:57for what we should be
  • 01:01:57teaching our students.
  • 01:01:59A physical exam, in terms
  • 01:02:01of collecting data is not
  • 01:02:02something that an LLM can
  • 01:02:03do now.
  • 01:02:05Multimodal models this is why
  • 01:02:06I'm telling you pathology everyone
  • 01:02:08who confidently predicts that pathology
  • 01:02:09is going to be computerized,
  • 01:02:11that technology is way ways
  • 01:02:12away. Multimodal models do not
  • 01:02:14perform very well.
  • 01:02:16And
  • 01:02:17maybe five to ten years
  • 01:02:18from multimodal models being able
  • 01:02:20to perform at a human
  • 01:02:20level. But so in the
  • 01:02:22interim, an accurate physical exam
  • 01:02:24becomes incredibly important. And you
  • 01:02:26saw that exam that the
  • 01:02:27student that the resident put
  • 01:02:28in there. That's a templated
  • 01:02:29exam.
  • 01:02:31I mean, probably the resident
  • 01:02:32did an appropriate exam for
  • 01:02:34someone who they thought might
  • 01:02:35have a dissection or PE,
  • 01:02:36but she documented just the
  • 01:02:38templated exam and that can
  • 01:02:40throw off a language model.
  • 01:02:41So when it gets to,
  • 01:02:42like, what are the things
  • 01:02:43that are uniquely good at
  • 01:02:44humans in a world where
  • 01:02:45we're working more and more
  • 01:02:46with AI models,
  • 01:02:47good observational skills and learning
  • 01:02:50how to do those skills
  • 01:02:51and then
  • 01:02:52accurately
  • 01:02:53represent them is really important.
  • 01:02:55I think we'll call it
  • 01:02:56a day. We'll get Adam
  • 01:02:58a chance to do a
  • 01:02:58Phew. Yeah.