When the Human Genome Project wrapped up in 2003, the world got its first complete instruction manual on how to build a human. The only problem? No one knew how to read most of it.

That’s because just a tiny fraction of the 3 billion letters in the manual form words that can be easily interpreted. Scientists have long relegated the rest, for the most part, to the trash heap.

But the results of a new study published in the June 14 issue of Nature reveal that there’s a lot more going on in the vast, uncharted regions of the genome than previously supposed, and suggest that so-called “junk DNA” may not be junk after all. Now the challenge is to figure out what all that DNA is for. Doing so may prove crucial for understanding complex human diseases.

“It’s sort of like Lewis and Clark,” says Michael Snyder, Ph.D., Lewis B. Cullman Professor of Molecular, Cellular and Developmental Biology and of molecular biophysics and biochemistry. “We’re trying to map out what’s there.”

Toward that end, Snyder’s lab is part of the Encyclopedia of DNA Elements (ENCODE) project, a mammoth undertaking of the National Human Genome Research Institute (NHGRI; part of the National Institutes of Health) involving a consortium of 35 groups of researchers at 80 institutions in 11 nations. These researchers have spent the last four years sifting through more than 400 million data points to make sense of just one percent of the human genome. That may not sound like much, but scrutinizing even this small portion of our DNA has turned up some surprises.

For one thing, the genome hosts a lot more activity than anyone expected. For over four decades, the canonical view has been that the important bits of our DNA—the readily decipherable genes making up 1.5 percent of the genome—get converted into RNA via a process called transcription. RNA, in turn, instructs the cell to make proteins, the molecular movers and shakers that do all the heavy lifting in the cell. Scientists have long assumed that, in general, each gene is transcribed into one RNA fragment and that the gene-free portions of our DNA—a whopping 98.5 percent of the genome—aren’t transcribed at all. Not so, according to results from the ENCODE pilot project. They show that most of the letters in the genomic instruction manual wind up being transcribed. This happens, in part, because each gene is often transcribed along with a surprisingly large number of non-protein-coding sequences to produce some extraordinarily long RNA fragments. The results also indicate that a single gene can be transcribed into many different RNA fragments of varying lengths such that each gene is represented, on average, by more than five transcripts that share overlapping sequences.

“That’s a lot more than we thought there would be,” says Snyder, adding that it’s unclear what all these extra transcripts are for.

Even more perplexing is the prevalence of RNA molecules transcribed entirely from gene-free portions of the genome. Non-protein-coding (NPC) RNA transcripts were previously known to exist, but the ENCODE project identified many new ones. Again, their purpose is unknown.

As intriguing as these findings are, Snyder is even more excited that the project has identified new regulatory regions—portions of the genome that do not encode proteins but instead, control when, where and how much genes are expressed. A slew of recent studies have linked complex diseases with variations in NPC regions of the human genome that could have regulatory functions.

In May, for example, two independent groups linked heart disease with DNA variations in NPC portions of chromosome 9, and in April, a diabetes-linked variation was found in the same NPC region, along with six other variations found elsewhere in the genome. Might some of these variations in NPC DNA promote disease by interfering with the expression of genes at distal sites?

Snyder and his collaborators hope to answer questions like this by mapping out where all the regulatory regions are and how they work. “To me,” he says, “this is really what the ENCODE project is all about.”

The next step is for scientists to go from looking at just 1 percent of the genome to analyzing the whole thing. As daunting as that sounds, Elise A. Feingold, Ph.D., the program director for ENCODE at NHGRI, believes there’s good reason to be optimistic.

“The pilot project was a success,” says Feingold, who earned one of the first Ph.D. degrees from the School of Medicine’s Department of Genetics (then Human Genetics) in 1986. “It shows we can do this.”

And because cheaper, faster technologies keep popping up, Feingold believes it won’t be prohibitively expensive or time-consuming to do. NIH has set aside $100 million to fund the scaled-up project for the next four years. This might not be enough to fully analyze the whole genome, Feingold concedes, but “we’ll get a good way there.”