Social Sciences | 12.13.2016

Deploying Differential Privacy to Protect Human Research Data

Protecting human subjects’ research data

by Jonathan Shaw

From The January-February 2017 Issue

Salil Vadhan
Courtesy of Harvard John A. Paulson School of Engineering and Applied Sciences

Return to main article:

Personal Information in the Digital Age

Social science data, including medical and public-health information, can be mined to improve people’s lives. But doing so in a way that promotes sharing of the data while simultaneously preserving privacy by traditional means such as redaction and de-identification has proven difficult. Salil Vadhan, as lead principal investigator on Harvard’s Privacy Tools Project (together with Weatherhead University Professor Gary King, director of the Institute for Quantitative Social Science; professor of government and technology in residence Latanya Sweeney, director of the Data Privacy Lab within the IQSS; and other senior personnel, including Berkman Klein Center for Internet & Society executive director Urs Gasser, McKay professor of computer science Stephen Chong, and associate professor of statistics Edoardo Airoldi) is directing efforts to develop a tool that will make confidential data—sensitive economic information, information about political preferences, or diagnoses of disease, for example—useful to researchers while preserving the privacy of personal information. Given that there are now legal mandates for sharing data from federally funded research projects, as well as laws requiring that personal information be kept private and secure, Vadhan, Joseph professor of computer science and applied mathematics, and his colleagues have been working on a mathematical method called differential privacy that could deliver both services to academic researchers.

Differential privacy, which was invented by incoming SEAS faculty member Cynthia Dwork and senior research fellow Kobbi Nissim, now a professor of computer science at Georgetown University, doesn’t just scrub personal information like name, address, and birthdate in order to hide individual identities (a process called de-identification that is sometimes easily, and notoriously, reversed; see “Exposed: The Erosion of Privacy in the Internet Era,” September-October 2009, page 38). Instead, the original data set is stored securely in a central location, with access mediated through an interface that allows researchers to gather useful population-level information without letting individual-level data leak out. The interface must enable “all or many of the kinds of statistical analyses typically done in social science,” Vadhan explains: “gathering summary statistics, performing regressions, and posing causal hypotheses: does one particular government intervention, for example, cause an increase in socioeconomic status?”

The differential privacy tool that the project is developing is a computational tour de force that achieves anonymity for individuals by introducing random noise into the way statistics about the data are computed. “The amount of noise is carefully calibrated to hide the contribution of each individual person, but still reveal larger effects. And so there is a tradeoff,” says Vadhan. “You get greater privacy protection the more noise you introduce.” Although the larger results are therefore not perfectly accurate, this is “not new to privacy protection,” he points out. Anyone doing statistics knows about tolerating error: “Whenever you derive statistics from a sample…you’re not getting an exact reflection of the population, you’re getting some statistical estimate.”

The project is also building legal and policy tools—including robot lawyers, as some of Vadhan’s colleagues call them—to help researchers navigate the complex array of laws, regulations, and best practices involved in handling sensitive data. The Berkman Klein Center, with key input from senior researcher David O’Brien, an attorney, and fellow Alexandra Wood, plays an important role in this part of the project together with Sweeney, IQSS chief data science and technology officer Mercè Crosas, fellow Michael Bar-Sinai, data scientist Micah Altman of MIT, Chong, and three of Chong’s students: Ph.D. candidate Aaron Bembenek, Obasi Shaw ’17, and Kevin Wang ’18. Among the tools are automated interviews that help determine how data should be shared and stored, and the terms of its use. What are the relevant legal constraints on a particular type of data (student data is covered by one statute, health data by another, for example), and what are the relevant institutional policies? This allows the system not only to make recommendations for the appropriate handling of a particular data set as it is initially stored in a repository, but also to generate automatically a set of licenses authorizing other researchers to use the data later. The work—a collaboration among computer science and statistics, IQSS, and the Berkman Klein Center, is extremely important, because analysis of data in the social sciences, medicine, and public health can lead to tremendous societal benefits (see “Why Big Data is a Big Deal,” March-April 2014, page 30.)

The work has given Vadhan practical insight into the debate over the use of distributed versus centralized systems for enhancing privacy and security (see main text). On the one hand, advocates of distributed computing systems, like Yochai Benkler, emphasize the structural and legal vulnerabilities of storing vast quantities of sensitive data in a single location. On the other, engineers like Ben Adida have reported on the technical and practical hurdles facing smaller companies and open-source efforts that attempt to deliver security in distributed settings.

“Both Ben and Yochai have a lot of wisdom on this topic,” says Vadhan. A theoretical computer scientist who earned his doctorate at MIT after graduating from the College, Vadhan’s background is in what’s called ‘computational complexity,’ understanding the fundamental limits of computation. “There have been a lot of strides in making what’s called ‘secure multi-party computation’ more practical,” he says. “This is the part of cryptography theory that tells you how to take a centralized computation and turn it into a distributed one.”

For distributed systems, he continues, “some really amazing things are possible—in theory.” The privacy tools project itself has focused on the centralized model to start with, for one simple reason: “it’s technically easier,” he says. “Modern cryptography tell us that, in principle, anything you can do with the data centralized, you can also do in a distributed setting where the data is not collected in one place. But Ben is right, that when you talk about bringing the theory to practice and actually making it work, both at a technical level and also at an institutional, organizational, and political level, getting one of these distributed systems to work out can be quite challenging.” A new privacy tools project led by Kobbi Nissim that started in April 2016 will focus on advancing the science of privacy-preserving computation on distributed data. “Hopefully, one day we’ll reach the point where more of these distributed systems become practical,” Vadhan adds. But this can happen only if the privacy of human research subjects can be assured first, so the data can be shared widely—and safely.

Published in the January-February 2017 print issue under the headline “The Privacy Tools Project,” in the section.