Adapting a Machine Learning Algorithm to Track Symptoms of COVID-19
 .
Adapting a Machine Learning Algorithm to Track Symptoms of COVID-19
May 22, 2020Julie Womack PhD, CNM, FNP (BC) Associate Professor of Nursing
Information
- ID
- 5232
- To Cite
- DCA Citation Guide
Transcript
- 00:00Like to introduce our next Speaker,
- 00:02Doctor Julie Womack.
- 00:03Doctor Womack is an associate
- 00:05professor at the Yale School of
- 00:07Nursing and a Health Sciences
- 00:09researcher at the West Haven, VA.
- 00:11She received her PhD in nursing from Yale
- 00:13University and completed at Post Doctoral
- 00:15Fellowship in Informatics at the VA.
- 00:18Doctor Womack, thank you for being here.
- 00:22Thank you and I can everyone hear me.
- 00:25I hope, um.
- 00:27So I'll be talking with you about
- 00:29the work that colleagues of mine
- 00:31and I are doing to adapt in LP
- 00:34pipeline machine learning algorithm
- 00:36to identify systems of coded 90.
- 00:40Next Symptoms are of crucial
- 00:42importance to patients.
- 00:44They are how an individual
- 00:46experiences their illness.
- 00:47For providers and symptoms are markers
- 00:50that can help to identify disease or
- 00:53to develop a list of differential.
- 00:56Recognition of symptoms has been
- 00:58an important component of coded 19.
- 01:00The symptoms consider this markers
- 01:02of the disease has changed overtime.
- 01:05Initially it was fever,
- 01:06cough and shortness of breath.
- 01:08These are still considered to
- 01:10be the primary symptoms,
- 01:12but this list has expanded to include
- 01:15others such as nasal congestion,
- 01:17sore throat, and osmia.
- 01:19Like you see a headache,
- 01:21dizziness, fatigue, muscle aches,
- 01:23chills and GI symptoms,
- 01:24including nausea, loss of appetite.
- 01:26Vomiting and diarrhea next week.
- 01:30In both Farzin Murs symptom based
- 01:33case detection and subsequent testing
- 01:35to guide isolation and Quarantine
- 01:36with keys and there was minimum
- 01:39evidence that asymptomatic cases were
- 01:41important routes of transmission.
- 01:43With COVID-19 there was potentially
- 01:45a sizable percentage of cases
- 01:47that are asymptomatic,
- 01:48and these have been shown to be
- 01:50important players in viral transmission.
- 01:53So symptoms alone are insufficient to
- 01:55identify cases of Coke in 19 or even
- 01:58to identify those who should be tested next.
- 02:01Furthermore,
- 02:01most of the symptoms experience with
- 02:04COVID-19 are not unique to code at 19,
- 02:07but rather are shared by respiratory viruses,
- 02:10other respiratory viruses
- 02:11and health conditions.
- 02:12Here's a graph of coded 19 testing within
- 02:15the VA Connecticut health care system.
- 02:18There the number of test is
- 02:21noted on the vertical axis.
- 02:23Dates are on the horizontal,
- 02:25the Green Line represents
- 02:27negative COVID-19 tests.
- 02:28The red line is positive tests
- 02:30and the baseline represents
- 02:32those with pending results.
- 02:34So out of the 1600s has done through May 7th,
- 02:38214 or 12% were positive.
- 02:40For COVID-19,
- 02:41these results suggest that the majority
- 02:44of those with symptoms may not,
- 02:46in fact have coded 19 next.
- 02:52But despite the limitations to using
- 02:54symptoms to diagnose so bad 19 or to
- 02:57identify those who need to be tested
- 02:59in Corentine symptoms are still an
- 03:01important component of the pandemic.
- 03:03I work with a number of investigators
- 03:06who are interested in using VA electronic
- 03:08health record or HR data to study
- 03:11different aspects of symptoms encoded 19.
- 03:13The first step for all of these projects
- 03:15is to develop a reliable approach for
- 03:18identifying these symptoms in the HR,
- 03:21a number of possible approaches exists.
- 03:23These include looking at. Problem with.
- 03:25ICD codes and inferring symptoms
- 03:28from prescription data. However,
- 03:30all of these approaches underestimate
- 03:32the number and type of symptoms.
- 03:34Discuss that a visit.
- 03:36Most documentation of symptoms
- 03:38takes place in clinical note.
- 03:40These documented symptoms can be extracted
- 03:42from text notes using natural language
- 03:44processing and machine learning algorithms,
- 03:47and then converted into structured data.
- 03:49For the purposes of analysis.
- 03:53So today I'm gonna talk a bit about the
- 03:55symptom extractor pipeline that we will
- 03:57adapt to identify COVID-19 symptoms in VA.
- 03:59Clinical note,
- 04:00I'm going to talk a bit about what
- 04:02that adaptation process will look like,
- 04:04and then I'm going to briefly describe
- 04:07projects that will build on it work next.
- 04:10The symptom extractor pipeline that we
- 04:12will use with originally developed by
- 04:14Guide Devita and colleagues from the VA,
- 04:17Salt Lake City health care system.
- 04:19Next It is a uema natural language
- 04:23processing pipeline that was assembled
- 04:26using B3 LP framework components.
- 04:29Both arena and be free.
- 04:31An LTR open source software.
- 04:33You we met short for unstructured
- 04:36information management architecture is
- 04:37an Oasis standard for content analytics.
- 04:40Originally developed at IBM.
- 04:41The VPN LP framework is a set of
- 04:45functionality zan components that
- 04:47provide Java developers the ability
- 04:50to create novel annotators,
- 04:52place annotators into pipelines,
- 04:54and include applications to extract
- 04:56concepts from clinical text.
- 04:58These are scale up and scale out
- 05:01functionality's developed with the
- 05:03expressed purpose of processing
- 05:05large numbers of records.
- 05:07Machine learning annotator was added
- 05:09at the tail end of the LP pipeline
- 05:12to enhance the pipeline's ability
- 05:14to identify through symptoms.
- 05:16This figure depicts the components
- 05:18of the Simpson extractor pipeline.
- 05:20As is typical of Uema Pipeline,
- 05:22this one is composed of a series of
- 05:25annotators where the output of one
- 05:27becomes the input of the next next.
- 05:32Annotators at the front end of the pipeline
- 05:35decompose text into document elements.
- 05:38The Specializer breaks the notes into
- 05:40sections, so she complaints history
- 05:43past medical history, medications, etc.
- 05:45Tokenizer then breaks up the notes
- 05:47further into component parts, including
- 05:50for example sentences or phrases next.
- 05:54The next part of the pipeline identified
- 05:57templated components of the notes
- 05:59that require an assertion logic
- 06:01different from that used in plain text.
- 06:03Note thanks.
- 06:07So we're all familiar with
- 06:09the straightforward soap note
- 06:10documentation as shown in this sample,
- 06:12so the subjective and object
- 06:14information from the patient is noted,
- 06:16and then assessments in plans are made.
- 06:19The symptom statements here are
- 06:21fairly straightforward positive.
- 06:22For shortness of breath and negative
- 06:24for pain, chest pain, and palpitation.
- 06:28Next Check boxes are one form that
- 06:31templated text can take obvious.
- 06:33Obviously this is not natural language,
- 06:35so the logic used to identify symptoms
- 06:38here must be very different from
- 06:41that used for a simple soap note.
- 06:44Here, the condition of interest
- 06:46is only true if there is a check
- 06:48next to the concept of inference.
- 06:50So for example,
- 06:51in the first section homeless is mentioned,
- 06:53but the computer needs to recognize
- 06:55that the individual is only homeless if
- 06:58there is a check mark next to that box.
- 07:01Next
- 07:04For slots and values there is a
- 07:06templated request. For information here.
- 07:08Information requested include percent service
- 07:10connected disability and individuals,
- 07:12religion, marital status,
- 07:13living situation, etc. Responses need
- 07:15to be placed next to the request.
- 07:18So for example, in line G,
- 07:20much is in the checkboxes.
- 07:22The computer needs to recognize
- 07:25that the individual has children
- 07:28only if a non 0 number is placed
- 07:30next to the slot for children.
- 07:33Next So again,
- 07:34this part of the pipeline identifies
- 07:37templated note sections and flag
- 07:39them so that the computer can use
- 07:41the appropriate logic to identify
- 07:44the presence of symptoms next.
- 07:46The term identification annotator is
- 07:48the dictionary look up portion of the
- 07:52pipeline and Dictionary of 92,000 concepts,
- 07:55or 100 and 22,000 symptom forms
- 07:58was created from unified medical
- 08:01language system or you M LS sources.
- 08:04Terms within this resource are tagged
- 08:07with a symptom category along with a
- 08:09set of 15 organ system sub categories.
- 08:12A Dictionary of idiosyncratic symptom
- 08:14phrases and symptoms not covered by the
- 08:17symptom dictionary is also employed next.
- 08:21In annotator was created specifically
- 08:23to identify potential symptoms by rules
- 08:26and patterns formed from annotations
- 08:28created by the dictionary look up
- 08:31and document decomposition next.
- 08:35The context assertion annotator was
- 08:36included to identifying negation,
- 08:38so patient denies pain.
- 08:39It identifies the subject.
- 08:40So is it the patient who reports
- 08:43the symptom or someone else?
- 08:44For example, in the family
- 08:46history section of the note.
- 08:48It identifies hypotheticals.
- 08:50For example,
- 08:51many medications are prescribed PRN,
- 08:53PRN pain, or PRN dizziness.
- 08:56It also identifies whether or not
- 08:58the symptom is occurring now,
- 08:59or if it is historical.
- 09:01So something that occurred in the past,
- 09:03so a note could say something
- 09:05like six weeks ago patient
- 09:06reported o'clock if we were only
- 09:08looking for current symptoms,
- 09:09the computer would need to
- 09:11recognize that this cough is
- 09:13not current and should not be
- 09:15flagged as a symptom of interest.
- 09:17Next
- 09:20Initially, the dictionary and rule based
- 09:23mechanisms produced approximately 9
- 09:24false sense dimensions for each tree.
- 09:27Symptom identified.
- 09:28An additional mechanism was needed
- 09:30to filter down the false positive.
- 09:33Tail end annotator that employs the
- 09:35machine learning model trains on 65
- 09:38features gleaned from the upstream
- 09:40annotators was developed for this purpose.
- 09:43This model uses support vector machine
- 09:46coupled with stochastic gradient descent
- 09:48as the classification algorithm next.
- 09:51The original performance metrics
- 09:52for the model were fairly good,
- 09:55so precision or positive convicted value
- 09:57with 0.8 recall or sensitivity with 0.7
- 10:00and the F measure was zero point 8.
- 10:03Next So our goal in this initial
- 10:08project is to adapt this symptom
- 10:11extraction pipeline to identify COVID-19
- 10:14symptoms in patients over time next.
- 10:17Our sample will include veterans
- 10:19from two well established VA cohort.
- 10:21The women veterans cohort or
- 10:23Windex and the VA birth cohort.
- 10:26We will include individual to tested
- 10:29positive for COVID-19 and we will include
- 10:33all of their notes from 2 weeks before
- 10:36the diagnosis through two weeks after.
- 10:39Give you a bit of information
- 10:41on the two cohorts.
- 10:43With it is a cohort of veterans identified
- 10:45from the roster of post 911 conflict.
- 10:48Information from the roster is
- 10:50available and include separate data,
- 10:52birth date of last deployment
- 10:54and armed forces,
- 10:55branching component roster data
- 10:57have also been linked to electronic
- 10:59health record data with its includes
- 11:02approximately 1.2 million individual.
- 11:04It represents a younger cohort.
- 11:06The mean age for women was
- 11:0829 an for men 30 years,
- 11:11as is typical in the VA.
- 11:15As a typical in the VA discovered,
- 11:18is primarily male, an white.
- 11:20However, it is important to remember
- 11:21that within the VA there is richer
- 11:24racial and ethnic diversity
- 11:25than in the general population,
- 11:27particularly among women next.
- 11:30The VA birth cohort is an EHR based cohort.
- 11:33It includes all veterans
- 11:35born between 1945 and 1965,
- 11:37so these are baby boomer better.
- 11:39Much older than those than most of those in
- 11:42with the total sample size is 4.2 million.
- 11:45The age range is 55 to 75 years and
- 11:48again it is majority white and male,
- 11:51but it is important to note that even
- 11:54though women are only 15% of this cohort,
- 11:57this represents almost half a
- 12:00million women next.
- 12:01In terms of our sample size,
- 12:04as of May 16th at 5:41 PM,
- 12:07the cumulative number of coded
- 12:0819 cases within the VA with
- 12:11approximately 12,000 next.
- 12:14So how are you gonna test and adapt our
- 12:17system pipeline as a first step will
- 12:19be to restrict the Simpson dictionary
- 12:21so that the terms included are only
- 12:23those pertinent to COVID-19 next.
- 12:27The next step is to run this
- 12:30restricted symptom extractor
- 12:31pipeline on all of the notes and to
- 12:33have clinicians review to result.
- 12:357 conditions will review a
- 12:37random subset of 700 note.
- 12:39Conditions will first create guidelines
- 12:42for identifying positive and negative
- 12:44note based on their clinical knowledge
- 12:47and an initial review of 100 note.
- 12:49The guidelines will be revised.
- 12:51Intel Acampe of 0.85 for Inter
- 12:55rater reliability is achieved.
- 12:57Each condition will then review
- 12:59and evaluate a hundred-and-fifty
- 13:00notes out of the remaining 600
- 13:02nodes so that each node is reviewed
- 13:04by at least two clinicians.
- 13:06We will then compare reviewer assessments
- 13:08where the two reviewers disagree.
- 13:09The Pi will make the final decision next.
- 13:13The third step will be to compare
- 13:16the symptoms identified by the
- 13:18pipeline with those identified by
- 13:19the clinicians in these 700 notes,
- 13:21and we're targeting precision
- 13:24and recall at 0.8 next.
- 13:26If we do not achieve this goal,
- 13:28there are a number of approaches that we
- 13:30can use to improve pipeline performance.
- 13:32The first will be to augment the symptom
- 13:35terms identified by the dictionary.
- 13:36To do this,
- 13:37we will use topic modeling to identify
- 13:39relevant symptom terms in the note.
- 13:42Topic modeling is a machine learning
- 13:44techniques that can be applied to
- 13:46large corpora to discover themes,
- 13:48IE symptom topics that are
- 13:50semantically related.
- 13:51We can create Raina bidirectional
- 13:54encoder representations from
- 13:55Transformers or bird model on 10,000
- 13:58documents with keywords to boost the
- 14:00LP's ability to recognize synonyms
- 14:02related terms and misspelling.
- 14:04Finally,
- 14:04we can target the machine learning
- 14:07component of the pipeline and train
- 14:10and test support vector machine models
- 14:12with different configurations next.
- 14:15We're applying for funding for this
- 14:17project from the VA rapid response project.
- 14:20Calls were also submitting a proposal
- 14:22in response to why a sense called
- 14:25for intramural pilot gram next.
- 14:27Once we have adapted the pipeline
- 14:29to accurately identify COVID-19
- 14:31symptoms in VAEHR text notes,
- 14:33there are a number of projects that
- 14:36we are interested in pursuing next.
- 14:39The first project will focus on
- 14:41evaluating the risk of infection
- 14:43and death associated with SARS, Co.
- 14:45V2 and influenza in the six months
- 14:48following the index infection with COVID-19.
- 14:51So in 19 will be defined as a
- 14:53positive arc collected at least eight
- 14:55weeks after the index and affection
- 14:58and by the presence of symptoms.
- 15:00This project is led by Doctor Rupert,
- 15:03got an instruction Infectious Diseases
- 15:05at the West Haven BA and a yellow Haven.
- 15:08His mentors include doctors,
- 15:09Kathleen Aiken,
- 15:10Cynthia Branson name each up next.
- 15:14We're also interested in looking at
- 15:16symptoms versus symptom clusters,
- 15:18and their associations with Cobit
- 15:1919 testing and seropositivity.
- 15:21In particular,
- 15:22we are interested in exploring whether
- 15:24symptoms are symptom clusters differ by age,
- 15:26sex,
- 15:27race and be a region on the P
- 15:29on this project,
- 15:31and I'm working with doctors cut
- 15:34bacon brands and Justice next.
- 15:37Additional projects include
- 15:39Validating an approach to identifying
- 15:41COVID-19 infection in VA data for
- 15:43research in Qi purposes that include
- 15:45the combination of symptoms
- 15:47or symptom clusters,
- 15:48results of chest radiographs for CT scans,
- 15:51an arc testing were also interested
- 15:53in exploring whether or not we can
- 15:56use the adapted symptom extractor
- 15:57as the foundation for an EHR based
- 16:00bio surveillance system to identify
- 16:03the onset of new code.
- 16:0519 searches were interested in seeing
- 16:07whether or not this symptom extractor.
- 16:09Can be adapted to other electronic
- 16:11health records such as epics,
- 16:13into other electronic data
- 16:15sources such as Google.
- 16:17Finally, we're interested in
- 16:19looking at associations between
- 16:20symptoms and symptom clusters.
- 16:22With code 19 viral load next.
- 16:26All the work that I've described
- 16:28as the product of team science,
- 16:30members of the team are from Yale,
- 16:32the School of Nursing,
- 16:33and the school of Madison,
- 16:35George Washington University and OHSU next.
- 16:37Thank you much.
- 16:38Thank you very much for your time.
- 16:46Thank you very much.