R.I.P., junk DNA: not the DNA as such, but the moniker that has described it in a misleading fashion for years. Scientists have long known that vast swatches of the human genome don’t produce proteins. They have also known that these sections are nonetheless active. How much of the genome produces proteins was not known until the first draft of the Human Genome Project, released in 2000, tallied the coding regions of the genome. Only about 1 percent—roughly 21,000 genes—codes for proteins. And the other 99 percent?

The National Human Genome Research Institute (NHGRI) began a follow-up to the Human Genome Project in 2003. With a budget of $288 million, the Encyclopedia of DNA Elements (ENCODE) would map that 99 percent and catalogue its functional elements for a better understanding of the genome and its role in human biology and disease. ENCODE enlisted 440 researchers at 32 institutions in the United States, the United Kingdom, Spain, Singapore, and Japan, who communicated via wikis, Google docs, and two face-to-face meetings each year. The researchers began with a pilot project that would study just 1 percent of the genome while gauging research methods and technologies. Their findings, published in Natureand Genome Research in 2007, showed that the project could identify and characterize functional elements in the genome. In the next phase, the consortium went beyond the initial 1 percent and covered the whole genome by studying 147 cell types and performing more than 1,600 experiments.

In September the findings from those experiments were published in such journals as Nature, Genome Research, and Genome Biology. This research announced all ENCODE, “gives the first holistic view of how the human genome actually does its job.”

The consortium found biological activity in 80 percent of the genome and identified about 4 million sites that play a role in regulating genes. Some noncoding sections, as had long been known, regulate genes. Some noncoding regions bind regulatory proteins, while others code for strands of RNA that regulate gene expression. Yale scientists, who played a key role in this project, also found “fossils,” genes that date to our nonhuman ancestors and may still have a function. Mark B. Gerstein, Ph.D., the Albert L. Williams Professor of Biomedical Informatics and professor of molecular biophysics and biochemistry, and computer science, led a team that unraveled the network of connections between coding and noncoding sections of the genome.

Arguably the project’s greatest achievement is the repository of new information that will give scientists a stronger grasp of human biology and disease, and pave the way for novel medical treatments. Once verified for accuracy, the data sets generated by the project are posted on the Internet, available to anyone. Even before the project’s September announcement, more than 150 scientists not connected to ENCODE had used its data in their research.

“We’ve come a long way,” said Ewan Birney, Ph.D., of the European Bioinformatics Institute (EBI) in the United Kingdom, lead analysis coordinator for ENCODE. “By carefully piecing together a simply staggering variety of data, we’ve shown that the human genome is simply alive with switches, turning our genes on and off and controlling when and where proteins are produced. ENCODE has taken our knowledge of the genome to the next level, and all of that knowledge is being shared openly.”

Big data, big questions

The day in September that the news embargo on the ENCODE project’s findings was lifted, Gerstein saw an article about the project in The New York Times on his smartphone. There was a problem. A graphic hadn’t been reproduced accurately. “I was just so panicked,” he recalled. “I was literally walking around Sterling Hall of Medicine between meetings talking with The Times on the phone.” He finally reached a graphics editor who fixed it.

An academic whose scholarly and personal interests focus on information and how we make sense of it, Gerstein had run up against the juggernaut of the 24-hour news cycle. But in the end, he helped The New York Times get it right, just as he’d played a role in helping the international consortium of ENCODE scientists interpret the vast expanse of data that they uncovered. The concept of “big data,” an amount of information so large that it challenges efforts to store and use it, is key to ENCODE and to Gerstein’s work generally. “It’s really a very transformative idea in terms of how people approach experiments and how people think about analyzing things,” he said. He likened a rich data resource to a great piece of literature, “something that’s kind of transcendent and speaks to many different people.” It can inspire and answer many questions. “I do think that particularly for genomic data sets,” he said, “the value of the data set goes beyond the initial question.”

Given the new availability of incredibly large data sets, a scientific supergroup with high levels of collegiality and collaboration was essential to the success of ENCODE. Having one group carry out the project allowed for a uniformity of method and reporting that was critical, said Michael Pazin, Ph.D., program director in functional genomics at NHGRI. Imagine the confusion caused by a map where thick blue lines sometimes represent interstate highways and other times rivers. But there is ample room for small projects to emerge based on the availability of the new resource, added Elise Feingold, Ph.D. ’86, program director in genome analysis at NHGRI. “I don’t think (consortia are) ever going to substitute for the individual researcher and these small collaborations,” she said.

Gerstein took to the collaborative process, according to Birney. “Mark likes to find a scenario where everyone gets along without compromising the science. This is not always as easy as it sounds, and takes some effort talking to people. Like all of us, Mark has some characteristic phrases, and I would always know that Mark didn’t quite agree on something when he would start, ‘Wouldn’t you say, Ewan, that ...’, and then he’d be into some point,” Birney said.

New technology paves the way

ENCODE would have been unthinkable without the technology and methodology to gather, store, and analyze enormous data sets. When Gerstein began his career things were different. He’d majored in physics and wanted to pursue a science that was driven by advances in computer technology. But there was no clear pathway to do that. He completed his doctorate at Cambridge, which is now home to the EBI. “There was no EBI,” recalled Gerstein. “There was no program in bioinformatics. I did a program in chemistry.” He wondered whether he’d stay in academia because most universities did not have even a single bioinformatics position.

In 1996, however, Donald Engelman, Ph.D., the Eugene Higgins Professor of Molecular Biophysics and Biochemistry, saw the need for computational expertise at Yale and recruited Gerstein. “I and others in the department were concerned about computation and its role in research,” remembered Engelman, who was not involved in ENCODE. “There would be an enormous explosion of information to deal with as genetic information became available and more structural information became available. Someone who can use those enormous databases is key.”

In those days, though, the tools for uncovering those data were still being discovered. When Valerie Reinke, Ph.D., associate professor of genetics, was in college, she’d often draw diagrams of cells on cocktail napkins to illustrate points to her friends who were not science majors. “It always amazes me that there are people who don’t want to know how their bodies work,” she said. Reinke was part of the modENCODE project, which focused on functional element identification similar to ENCODE, only in such model organisms as the fruit fly Drosophila melanogaster and the roundworm Caenorhabditis elegans. Reinke specializes in roundworm, which shares many genes with humans. About a fifth of the worm’s genome codes for proteins, making it easy to identify noncoding functional elements. The tools for discovering the fine details of what was happening in those sketches she drew in college were still evolving when she was a student. By 2000, when she joined the Yale faculty, microarray technology, which allows scientists to analyze expression of multiple genes in a single experiment, was brand-new. As with personal computing, DNA sequencing technology has rapidly grown more powerful, faster, and cheaper.

In 2007, as the ENCODE pilot project was ending and the next phase was getting started, next-generation sequencing technology became available. “That was really a remarkable confluence of events that we were able to take advantage of and was really a game changer for the project,” remembered Feingold.

The evolution of the technology is making it practical to look at genetics on an individual level, said Reinke, where information could be used to formulate treatments tailored to a particular patient. “We haven’t even begun to scratch the surface,” she said. “There are so many questions.”

One thing is clear. ENCODE will have profound implications for personal genomics. Each of us gets a double set of genes, with one copy, or allele, coming from each parent. Being able to determine allele-specific expression “brought home the idea of what you might call a personal annotation,” Gerstein said. “We think that this personal annotation is the next phase for genomics.”

This personal annotation, notes Gerstein, can raise ethical issues. Would you want an employer or health insurer to know about your susceptibility to a degenerative illness? These kinds of questions don’t stop at the molecular level, when Foursquare lets the world know in which Starbucks you’re enjoying a coffee and a friend can share on Facebook a picture from the office holiday party that you’d rather never saw the light of day. “I do think a big aspect of information technology, both big data and computing, is this erosion of privacy,” said Gerstein.

The myth of junk DNA

Some early press coverage credited ENCODE with discovering that so-called junk DNA has a function, but that was old news. The term had been floating around since the 1990s and suggested that the bulk of noncoding DNA serves no purpose; however, articles in scholarly journals had reported for decades that DNA in these “junk” regions does play a regulatory role. In a 2007 issue of Genome Research, Gerstein had suggested that the ENCODE project might prompt a new definition of what a gene is, based on “the discrepancy between our previous protein-centric view of the gene and one that is revealed by the extensive transcriptional activity of the genome.” Researchers had known for some time that the noncoding regions are alive with activity. ENCODE demonstrated just how much action there is and defined what is happening in 80 percent of the genome. That is not to say that 80 percent was found to have a regulatory function, only that some biochemical activity is going on. The space between genes was also found to contain sites where DNA transcription into RNA begins and areas that encode RNA transcripts that might have regulatory roles even though they are not translated into proteins.

But helping people grasp the massive import of ENCODE proved a challenge. “People don’t think that creating a resource is a sexy endeavor,” said Feingold.

“It’s so easy to either overpromise or undersell,” agreed Pazin. On the one hand, he did not want to make claims that ENCODE would quickly lead to cures for diseases like cancer. On the other, he didn’t want the public to ignore the discovery because it was too technical to understand. That’s why the “useful shorthand” of junk DNA so often came up in coverage, said Feingold.

Hopefully, ENCODE will help put an end to the notion of junk DNA. The project not only assigned general classes of functions to areas of the genome but also showed the complexity of how those areas interact. The project revealed the genome’s organizational hierarchy, with top-level regulators wielding vast influence while “middle managers” often have to collaborate to regulate genes.

There was no “Eureka!” moment, said Gerstein, who called the findings “the opposite of a discovery.” Instead, there were years of gathering and interrogating data to create a map of the vast majority of the genome. As many researchers have found, these noncoding regions are alive with regulatory activity that plays a critical role in human disease, though some of the functioning that was documented did not have such obvious applications.

His team, Gerstein said, took a different path from those of others involved in the project.

“Most of the project is more oriented on annotating elements,” he said. “Our unique perspective was to make it a network.”

If it were simply a genetic encyclopedia, ENCODE would catalogue its entries in isolation from one another. The Abaco Islands reside next to abacus in a conventional encyclopedia because that’s how the words fall alphabetically—not because the topics identified by the words have any intrinsic relationship. Knowing how different parts of the genome work together is far more powerful than simply compiling a parts list.

Through computational analysis, Gerstein’s lab broke apart the “hairball” of the regulatory networks to find working relationships. He developed statistical models that identified regulators located far away from the genes they influence. He found that the way the human genome is organized is not so different from the way humans organize themselves. Gerstein likens transcription factors that have considerable regulatory influence to top-level managers. As might be the case with their human analogues, these elite transcription factors tend to be conservative.

“What does conservative mean? Conservative means they’re more preserved. There’s less variation,” said Gerstein. “It’s sort of natural that in that kind of context, you don’t want them to change as much.”

The less influential transcription factors, which he terms “middle managers,” are less conservative and more likely to work cooperatively than their peers. Often these middle managers will co-regulate a gene, easing the flow of information in what would otherwise be “a bottleneck.”

There is less interaction between the top-level transcription factors and the bottom-level, least influential transcription factors than one would expect to happen by chance. The human genome is not egalitarian.

Gerstein and colleagues at the Sanger Center, the University of California at Santa Cruz, and Cold Spring Harbor Laboratory on Long Island also found about 12,000 pseudogenes—fossil genes dating back to our nonhuman ancestors—which at first glance appear to be dead. But it turns out that some pseudogenes, while they no longer code for proteins, are quite animated. “They’re very much on the edge between living and dead,” said Gerstein.

In some people, these pseudogenes are turned into actual genes. “What’s going on here? Is this a gene that’s being born?” he asked. Pseudogenes open a window on the history of our species. Some of these fossil genes may still be players in the regulatory network.

The impact on medicine

What will ENCODE mean for human health, and how soon will this genomic encyclopedia inform treatment? There is no easy answer to that critical question, according to Sherman Weissman, M.D., Sterling Professor of Genetics.

“I grew up with the field,” said Weissman, whose research interests include genome-wide mapping of gene activity and chromosome structure in humans. Weissman contributed to ENCODE through collaborations with former Yale professor Michael Snyder, Ph.D., a leader in the field of functional genomics who is now at Stanford.

Knowing the molecular basis of a disease carries no guarantee that a cure is imminent. Linus Pauling, Ph.D., linked sickle cell disease to an abnormal protein in 1949, making it the first genetic disease for which the molecular basis was known. But, Weissman noted, there is still no cure for it. On the other hand, survival rates for chronic myelogenous leukemia are improving thanks to Gleevec, a drug based on oncogene study that received FDA approval in 2001. Weissman is optimistic that genetic information could lead to effective treatments for Alzheimer disease, which he terms “simpler than cancer.”

“We have so much data, and a very large part of it hasn’t been fully exploited,” he said. “We’re really bumping up against the ceiling in some practical ways.”

One of the project’s findings is that genetic changes linked to disease occur between genes in places where ENCODE has identified regulatory sites. It’s still not clear how variations in these areas contribute to disease. “Some people were surprised,” said Pazin, “that disease-linked genetic variants are not usually in protein-coding regions. We expect to find that many genetic changes causing a disorder are within regulatory regions, or switches, that affect how much protein is produced or when the protein is produced, rather than affecting the structure of the protein itself. The medical condition will occur because the gene is aberrantly turned on or turned off or abnormal amounts of the protein are made. Far from being junk DNA, this regulatory DNA clearly makes important contributions to human health and disease.”

“It’s important to realize that these findings won’t be taken forward by people like Mark or myself—rather we have to empower clinical researchers to use this data,” said Birney of the EBI. “I think ENCODE will have a big impact on medical research—in particular, genome-wide association studies have a really remarkable overlap with ENCODE data outside of protein-coding genes, and this is leading to all sorts of new hypotheses of how these diseases operate.” YM

Colleen Shaddox is a freelance writer in Hamden, Conn.