Nestled in our genomes are tiny sequences with immense power to control nearby genes.
Known as cis-regulatory elements, or CREs, these DNA sequences can switch neighboring genes on or off.
Now, researchers at Yale School of Medicine (YSM), the Jackson Laboratory, and the Broad Institute of MIT and Harvard University have developed a new generative AI method to design never-before-seen regulatory elements that precisely control how genes are switched on, or expressed, in cells. The AI-designed synthetic DNA can switch on genes only in specific kinds of cells in the body.
The researchers describe the AI platform, known as Computational Optimization of DNA Activity, or CODA, in an article published in the journal Nature on October 23.
Controlling how genes are expressed in certain kinds of cells could one day greatly improve gene therapy. This potentially curative method holds the promise to rewrite disease-causing mutations, but better methods are needed to deliver the therapies directly to cells that harbor the disease—for example, the specific kinds of neurons that fail in Parkinson’s disease or the immune cells that harbor HIV.
CODA, the newly designed AI platform, could one day help to bring gene therapy to diseased cells in a more targeted manner, leaving the therapy inactive in healthy parts of the body where it might otherwise cause harm. Some early iterations of experimental gene therapies failed to progress to clinical use because of these damaging off-target effects. Ultimately, CODA’s designers hope to use the method to develop targeted gene therapies for brain disorders, metabolic diseases, and blood disorders.
Beyond human capabilities
“This project essentially asks the question: ‘Can we learn to read and write the code of these regulatory elements?’” said Steven Reilly, PhD, assistant professor of genetics at YSM and one of the senior authors of the study. “If we think about it in terms of language, the grammar and syntax of these elements is poorly understood. And so, we tried to build machine learning methods that could learn a more complex code than we could do on our own.”
That complex code stands in contrast to the language of our genes, which is written in a fairly simple cypher, cracked many decades ago. Each three-letter string in the sequence of a gene translates to different amino acids, the building blocks of proteins. With only 64 distinct three-letter combinations, the language of genes is not difficult to learn.
But not so with regulatory elements, which are part of the nearly 99% of the human genome that consists of DNA other than genes. These regulatory sequences don’t seem to follow a simple code, at least not one that humans can easily discern. And the space of potential combinations of DNA sequences that could make up these elements is massive: For an average-sized regulatory element, the possible number of different DNA sequence combinations is larger than the number of atoms in the known universe, Reilly said.
“All the computers in the world couldn’t search through every possible sequence combination, so you have to figure out a smart way to go through it,” he said.
Machine learning approaches only recently available
Such a massive problem requires computational approaches only recently available through deep learning, a form of artificial intelligence that the researchers used to generate new DNA sequences. Similar to generative AI approaches that underlie well-known tools such as DALL-E and ChatGPT, CODA can create new CREs based on its training databases.
Pardis Sabeti, MD, DPhil, co-senior author of the study and a core institute member at the Broad Institute and professor at Harvard, said the new technologies have extraordinary potential. "By applying machine learning and molecular biology to the logic of when and where CREs work, we can leverage that knowledge using generative AI to build tools for modulating gene expression in new ways experimentally and, perhaps one day, therapeutically,” Sabeti said.
This study involved complex work, and more will now follow. “Combining computational models with large-scale experimental approaches is a powerful strategy,” said Ryan Tewhey, PhD, associate professor at the Jackson Laboratory and co-senior author of the study. “However, the models are only as good as the data they learn from. By validating findings, we can quickly identify where improvements can be made.”
The scientists trained their AI model, CODA, on data from naturally occurring regulatory elements so it could iterate on DNA sequences that already work rather than sorting through every possible sequence. They used data from the activity of more than 775,000 different regulatory elements in human blood, liver, and brain cells grown in the lab. Regulatory elements can determine whether, or by how much, a gene is switched on or off—like molecular tuning knobs for our genes. And the elements themselves are often active only in a given cell type, such as a liver cell, meaning the genes they influence would be turned on only in that kind of cell.
Pinpointing specific target cells
The scientists tested the AI-designed regulatory elements in these same three kinds of cells and found that in many cases, the synthetic elements were actually more specific for a given cell type than any of our naturally occurring sequences. They then tested a subset of these synthetic elements in living zebrafish and mice and found that the sequences also worked to switch on test genes in specific cell types in the live animals. In one case, an AI-designed regulatory element switched on a reporter gene only in a very specific layer of cells in the mouse brain, despite being delivered everywhere in the animal’s body.
“We were impressed by how effectively CODA-designed sequences achieved cell-type specificity,” said Rodrigo Castro, PhD, a computational scientist at the Jackson Laboratory and co-first author of the paper.
Next, the researchers are planning to use different kinds of cells to develop regulatory elements specific to even more cell types. And they’re also planning to combine the AI-designed elements with other pieces of technology necessary for gene therapy, starting with certain diseases of the brain, metabolism, or blood. In theory, the approach could be used for any kind of genetic disease, Reilly said.
Sager Gosai, PhD, co-first author on the study and a postdoctoral fellow in Sabeti's lab at the Broad Institute, said this method may outdo human evolution as a means to treat disease. "Natural CREs, while plentiful, represent a tiny fraction of possible genetic elements and are constrained in their function by natural selection," Gosai said.
Reilly agreed.
“There are a lot of potential solutions out there for lots of different possible things you might want a regulatory element to do,” Reilly said. “Evolution maybe has never wanted to build a really great driver for an Alzheimer’s drug, but that doesn’t mean it can’t exist.”
----
This work was supported by Howard Hughes Medical Institute and by US National Institutes of Health grants UM1HG009435, R00HG010669, R01HG012872, and R35HG011329.