Skip to Main Content

The Role of Machine Learning Algorithms in Biomedical Discovery

May 10, 2022
by Elisabeth Reitman

Machine learning algorithms accelerate biomedical research in viral infection, cardiovascular disease, breast cancer and more at Yale’s van Dijk Lab.

David van Dijk, PhD, uses machine learning algorithms that analyze complex biomedical data. A computer scientist by training, van Dijk holds a dual appointment in medicine and computer science at Yale, where he uses graph signal processing and deep learning to find patterns in large data sets.

Launched in September 2019, the van Dijk Lab uses algorithms to accelerate discoveries in medicine. The lab develops new computational methods, based on machine learning, and applies these to large data sets to advance our understanding of a wide range of biological systems and diseases.

Van Dijk never anticipated that he would work in cardiovascular research. As a student, his interests evolved from computer science theory to application in a variety of fields. “Computer science can be a vehicle to understand the world, whether it’s biology, medicine, or sociology,” he explains. After graduating from college, van Dijk went on to earn a master’s and doctorate in computer science, at which point his focus shifted toward computational biology.

Insights into DNA Sequences from Machine Learning Algorithms

At the Weizmann Institute of Science in Israel, van Dijk worked with computational scientist Eran Segal, PhD, to investigate variability in gene expression, a process where DNA sequences enable genetic information to be read by the cell. Van Dijk co-developed a model to understand how promoter DNA sequences impart codes that determine whether a gene should be active or inactive.

Segal and van Dijk created advanced machine learning algorithms to work with large amounts of data encoded in the promoter region of a cell. By using data from yeast collected in a lab, van Dijk designed promoter regions and mutated them to create a large data set. The researchers then used the algorithm to find patterns that predicted the activity of genes based on these DNA sequences.

At the time, many scientists were measuring the expression or the activity of genes at the tissue level. For his post-doc research, van Dijk wanted to conduct experiments to see if gene expression could make predictions at the cellular level. The opportunity came in 2015 when van Dijk accepted a position at Columbia University.

Single-cell RNA sequencing was becoming more widely accepted. This experimental technology could be used to learn what genes are expressed in high-throughput. RNA sequencing from single cells could answer research questions on topics such as stem cell maturation cancer heterogeneity and variability within complex tissues.

For example, individual cells have diverse expression patterns, known as “on average” or “in bulk.” Scientists use RNA sequencing to measure cell activity in a tumor and understand the complexity of the tumor. With these insights into cell development, scientists could generate large data sets, but the technology had two limits: First the data lacked structure. Second, the methods used to gather data were inefficient, and critical information was often lost. Van Dijk realized that this was the perfect machine learning problem.

For decades, MIT’s Robert Weinberg has contributed to the characterization of human cancer genes. Van Dijk and Weinberg developed an algorithm that they applied to a breast cancer model used to measure the spread of cancer cells to new areas of the body. They discovered that when cells transition from their baseline endothelial phenotype into a metastatic mesenchymal phenotype, the process was associated with certain stem cell signatures. The cells become stem-like before transitioning to this more metastatic state, a subtle change that could lead to breakthroughs in cancer research.

From Large Data Sets to Improved Patient Care

The end goal of this research is not only to provide more precise diagnosis and treatments, but scientists also hope that machine learning algorithms will enable physicians to spend more time with patients. The technology exists, but it has not been deployed in the healthcare field until recently.

In September 2019, Yale launched a comprehensive DNA sequencing project called Generations. Progress is slow, however, for several reasons. Outdated healthcare systems were not designed with machine learning in mind, so it can be challenging to collect large sums of data. It’s also possible that the data may contain biases, inconsistencies, and incomplete information.

The machine learning algorithms that van Dijk developed can be applied to a variety of scenarios. On a given day, van Dijk could be working with clinicians to answer important problems about health records or designing an experiment to answer a fundamental question about molecular biology. The same algorithms often apply to multiple scenarios, whether it’s molecular biology or clinical data. The goal is to utilize health records data collected at Yale New Haven Hospital and relate it to patient outcomes. For example, van Dijk believes heart failure is one area that would benefit from van Dijk co-authored a paper in Nature Methods, where Yale scientists used an artificial intelligence neural network called SAUCIE to analyze 11 million cells to reveal cellular differences within individuals as well as provide more information on broader patterns that tell how the body functions.

Currently, the van Dijk Lab is collaborating with Yale colleagues on several projects. “Everyone is excited to collaborate,” he says. “I get exposed to so many interesting systems, and I have the opportunity to impact healthcare and medicine.”

Biomedical Imaging and Cardiovascular Research

Van Dijk is working with nuclear cardiologists to develop the first machine learning algorithm with the ability to analyze 3D images for new phenotypes.

A positron emission tomography (PET) scan is an imaging test that helps reveal areas of decreased blood flow to the heart. Every day, dozens of patients at Yale New Haven Hospital with severe [1] chest pain receive a stress test to track the function of the heart in 3D. A PET scan helps clinicians determine how well the heart is functioning and whether a patient may need invasive treatment. Van Dijk is now working with experts to extract outcomes information, and then leverage that data for more accurate diagnoses and to determine which patients benefited the most from a surgical procedure.

“The idea here is that perhaps we’re not maximizing the information we get. If we can extract additional information that perhaps you traditionally wouldn't have looked for, there may be very subtle information.” he said,

What’s challenging about this, he explains, is figuring out how an algorithm can ingest that data and examine a 3D image. And, how to ingest the data and extract meaningful information from that. Eventually, van Dijk hopes to combine health records data with the nuclear imaging data.

Van Dijk is also developing a research project with vascular biologist Stefania Nicoli, PhD, co-director of the Yale Cardiovascular Research Center (YCVRC). “There's a lot of excitement here,” he says. “The atmosphere has been really great and welcoming.” The partnership hopes to use machine learning to predict complex brain vascular patterns, which will provide new insights around how the cardiovascular system is shaped by genetic activity.

Demystifying Immune System-Gut Microbiota Interactions

In addition to his research with the YCVRC, van Dijk is also a collaborator on several immunology projects that could impact cardiovascular health.

The van Dijk Lab is collaborating with David Hafler, MD, the William S. and Lois Stiles Edgerly Professor of Neurology and professor of immunobiology, to generate a data set of immune cells to locate signatures of the homeostatic immune system in the cerebral spinal fluid to predict multiple sclerosis (MS). Hafler, who also is neurologist-in-chief at Yale New Haven Hospital, is widely recognized for his contributions in identifying the underlying causes of MS.

Other collaborators include Noah Palm, PhD and Aaron Ring, MD, PhD, both assistant professors of immunobiology. Palm’s research is focused on the complex interactions between the immune system and the gut microbiota. Together, van Dijk and Palm are investigating gut microbiome-host interactions.

Palm has developed an experimental technology that can measure how microbes interact with our immune system. Van Dijk’s goal is to develop a model to identify signatures in Palm’s model. Ring hopes to understand and manipulate the activity of immune receptors using structural and combinatorial biology approaches. If successful, Ring’s model could measure all of the antibody reactivity to predict outcomes of cancer immunotherapy.

Identifying Biomarkers of Viral Infection

In November 2021, van Dijk received a five-year $1.25 million MIRA R35 grant from the National Institute of General Medical Sciences (NIGMS) for a new project, “An integrative, data-driven, and computational approach to uncovering dynamic mechanisms of early viral infection.”

The idea here is that perhaps we’re not maximizing the information we get.

David van Dijk, PhD

The proposal aims to develop data-centered, computational approaches to identify distinct biomarkers of viral infection.

Technology holds great promise for the treatment of disease. To generate new therapies, scientists need to synthesize biological data sets to identify patterns in viral particles. Van Dijk stands at the forefront of a new technique known as single-cell RNA-sequencing or scRNA-seq, which delivers valuable information about the genes active in thousands of individual cells.

This new proposal expands on an earlier project, co-led by van Dijk and Craig Wilen, MD, PhD, an assistant professor of laboratory medicine and of immunobiology, on single-cell analysis of SARS-CoV-2 infection dynamics to determine how the SARS-CoV-2 virus infects and alters healthy cells.

The NIGMS-funded study will enable the van Dijk Lab to use biomedical machine learning algorithms to reveal the genetic determinants of viral infection—and improve our ability to predict and manage the progression of disease. Using these insights, the lab will focus on generating new hypotheses for a variety of disease backgrounds.

“This proposal promises to build approaches applicable across biological systems and processes, change our mechanistic understanding of viral infection, and, in the future, support therapeutic design,” said van Dijk.

This article was originally published on May 21, 2021; updated May 10, 2022.

Submitted by Elisabeth Reitman on May 21, 2020