Skip to Main Content

Harnessing the Power of Biomedical Data While Protecting Privacy

March 14, 2025
by Siddhant Pusdekar

With millions of people’s biomedical data at hand, solutions to long-standing medical puzzles are tantalizingly close. Such large amounts of data are enabling researchers to learn more about specific diseases—even very rare ones—as well as how diseases present differently across individuals, which could inform precision medicine efforts.

But for all their promise, these data come with the responsibility to protect people's privacy.

Biomedical data are stored in different repositories and managed by different groups. Combining data may allow for stronger analyses and could uncover new insights, but pooling separate datasets would both undermine the agreements made with those who shared their data and put their privacy at risk. Researchers, therefore, must navigate the challenges of unlocking the true potential of all this information while upholding data security.

Hyunghoon (Hoon) Cho, PhD, assistant professor of biomedical informatics and data science at Yale School of Medicine, is on the case. Cho’s team harnesses cryptographic, computational, and biomedical knowledge to create faster and more accurate research tools without compromising privacy.

We spoke with him about biomedical data security risks, working across data repositories to gain insight into health conditions, and his recent study on the topic published in Nature Genetics.

What attracted you to the field of biomedical data security?

It started out as a fun collaboration with my friend, David Wu, now a cryptographer [a security expert specializing in data encryption] at University of Texas at Austin. I had some experience in developing computational tools for biomedical research and we wanted to combine our knowledge to work on data security. As I was working on that project, I realized there's this entire set of problems around security and biomedical data that is currently unresolved, to which I could make an impactful contribution.

What is your research focus?

The two main prongs in my privacy work are to better understand the risks associated with different types of biomedical data, and to develop secure methods for working with these sensitive data.

There are different types of repositories for biomedical data collected by government agencies and research institutions. For example, biobanks contain genomic, electronic health, and some survey response data from individuals. There are other datasets linked to studies on specific health conditions. But these repositories are isolated from each other because of security concerns. Working across them is currently very difficult. We want to support any kind of biomedical research across these boundaries.

What sorts of risks come into play when managing this type of data?

One risk is re-identification through data linkage. In an environment where study participant data are publicly available, someone could link an individual’s data across repositories to figure out who the real person is behind a data point and then infer all this health-related information about them.

Why should people care about the privacy of their biomedical data?

Your personal health information might reveal diseases, genetic predispositions, or even family medical histories. Some people might not have an issue with their medical record being released, but the key to privacy is being able to say when you do or do not want your information to be public.

From a researcher’s perspective, working around siloed repositories sounds very inconvenient. Why is it important to do so anyway instead of simply putting the data all together?

I think that researchers and biomedical practitioners have a responsibility to protect subjects from potential harm and to honor the agreements the subjects made with a specific repository. More often than not, it is simply not possible to put all the data together in one place due to these constraints.

Finding ways to combine data in this restrictive environment is essential for research progress. Securing data would, hopefully, encourage trust and, therefore, wider participation in research, expanding the pools of data we can learn from.

In your recent Nature Genetics paper, you developed a system to work across repositories. What's new about it?

This was a collaboration with Bonnie Berger at MIT and Jean Pierre Hubaux at École Polytechnique Fédérale de Lausanne.

We adapted two well-known techniques from cryptography. One was homomorphic encryption, which allows us to perform mathematical computations directly on encrypted data. The other technique is called secure multi-party computation, a method where a group of people can combine their data without sharing access.

And we combined these methods to perform a very common analysis in biomedicine called a genome wide association study (GWAS). It essentially looks for genetic variants or mutations in the human genome that are linked to certain health conditions. Being able to find those patterns requires a large dataset that includes many individuals. We showed that it is possible to simultaneously analyze data from six repositories with a total of 410,000 individuals.

We also did this in a much shorter timeframe. The previous system, which we also developed, took months to years to do what we can now achieve in a matter of days. In addition, our study expands the range of supported GWAS analyses to include the most common approaches taken by researchers.

Working across repositories like this could be useful for studying rare diseases and demographic groups that might be underrepresented in any one repository, but more sizeable across several of them.

What are some of the major challenges you face in your research?

The cryptographic tools I mentioned have been around for about 40 years, but when you try to apply them to a specific problem in biomedicine they often require impractical runtimes stretching for months. The core challenge for us is having a deep understanding of both the underlying computational and cryptographic techniques as well as the target applications, such as GWAS. That interdisciplinary approach allows us to make faster, more secure, and useable tools.

The other challenge is conveying the strengths and limitations of these tools to people who are not security experts. We like developing these algorithms and we want people to use them. So, we've been collaborating with biomedical researchers, trying to help them run cross-biobank studies with our tools.

Are there any big questions you’re looking to tackle in the future?

One of the things I find exciting is leveraging all this information to create predictive models about individual patients for personalized health care. I would like to support that kind of research and the translation of these tools in clinical settings with privacy guarantees.

I should also mention that emerging AI models are a major challenge for us in the privacy and security community. While they have the potential to transform biomedicine, they also pose serious privacy risks that we do not yet understand well. Developing cryptographic tools for training and getting predictions using these models is also challenging because the models are so large. These are issues we want to address in future research.

The research reported in this news article was supported by the National Institutes of Health (awards R01HG010959, DP5OD029574, and RM1HG011558). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.