What will it take to achieve acceptable privacy-accuracy combinations? I discuss this question in five parts. In Section 1, I review the technicalities of the privacy-accuracy trade-off problem. In Section 2, I introduce the two articles in this symposium that study real-world privacy-accuracy combinations. In Sections 3 and 4, I ask, respectively: What is acceptable accuracy? What is acceptable privacy? I consider these questions through the lens of the two articles. I conclude in Section 5, connecting the technical questions that practitioners study (from Sections 1 and 2) with the normative questions that only society as a whole can and should answer (from Sections 3 and 4). I argue that while great technical progress has been made, much more work is needed before we can talk of getting even close to an acceptable privacy-accuracy combination. I touch on some currently investigated ideas for moving forward, and end with a call to open and broaden the conversation, bringing the rest of society in.
Keywords: privacy-accuracy trade-off, privacy-accuracy combinations, differential privacy, census, acceptable privacy, acceptable accuracy
Better late than never. It is only quite recently that we have realized, to our horror, that the private information collected by the U.S. Census may not be as protected as we had believed it to be: we have learned that it is possible to use aggregate statistics published in the past by the U.S. Census Bureau to reconstruct much of the underlying sensitive microdata. The idea is quite simple. The Census Bureau publishes billions of statistics on only slightly more than 300 million people. All a sophisticated ‘reconstruction attacker’ has to do is solve a system of billions of equations to find 300 million unknowns—an easy task, in principle and under some assumptions, with today’s computing power.
Enter differential privacy. The Census Bureau can use it to replace the underlying microdata with synthetic microdata. The synthetic data are created from the original data using randomness of a certain predetermined, publicly known, scale and features. By publishing only statistics computed on the synthetic microdata—or even, as a special case, publishing the entire synthetic data set—the Census Bureau can provide formally provable privacy protections. The more randomness used to create the synthetic microdata—and hence, the greater the accuracy loss—the stronger the privacy-protection guarantees.
What will it take to achieve acceptable privacy-accuracy combinations? I discuss this question in five parts. In Section 1, I review the technicalities of the privacy-accuracy trade-off problem. In Section 2 I introduce the two articles in this symposium that study real-world privacy-accuracy combinations. In Sections 3 and 4 I ask, respectively: What is acceptable accuracy? What is acceptable privacy? I consider these questions through the lens of the two articles. I conclude in Section 5, connecting the technical questions that practitioners study (from Sections 1 and 2) with the normative questions that only society as a whole can and should answer (from Sections 3 and 4). I argue that while great technical progress has been made, much more work is needed before we can talk of getting even close to an acceptable privacy-accuracy combination. I touch on some currently investigated ideas for moving forward, and end with a call to open and broaden the conversation, bringing the rest of society in.
Differential privacy quantifies the strength of its privacy protection with a parameter ε. Its interpretation, in three steps:
Take any possible census data set—that is, any data set that has the size and structure of the actual census microdata, but that is filled with any possible (potentially made-up) individual records.
Imagine any possible ‘neighboring’ data set in which a single individual’s record in said data set is changed.
Create synthetic microdata from the actual census microdata in such a way that if you applied the same (known) procedure to create synthetic data from any possible data set in 1 and from any of its neighboring data sets in 2, the probability of any published statistic based on either synthetic data set would not differ, across the two data sets, by more than a multiplicative factor of eε.
The intuition behind this privacy protection is as follows. No matter what true values the actual census data set contains (see 1), the privacy of any individual participating in that census is protected by the guarantee that if we replaced that individual’s data with some other data (see 2), the probability of any specific value in the published information would not change by more than an eε multiplicative factor. The privacy of an individual participating in the census lies, therefore, in that their individual data cannot affect the probability of any published outcome too much. How much is too much? More than by an eε multiple.
For example, the probability that the published (synthetic) data set classifies N individuals as being below 18 years old is at most eε times higher when that published data set is created from the actual census data set than when it is created from a neighboring data set where one of the (actual) below-18 individuals is replaced with an 18-or-above individual.
If ε were taken, in the extreme, to be 0, then eε would be 1, and the privacy guarantee would be absolute: the probability that N individuals are classified as children in the published data set would need to be the same regardless of the underlying microdata. Of course, this extreme value of ε would make the published data set useless: it would contain no information regarding the actual underlying, private microdata. The larger ε is, the more useful, or accurate, is the published data set, and the weaker is the privacy guarantee.
The holy grail in the Census differential privacy project is finding a value of ε that could allow for an acceptable combination: an acceptable level of privacy protection with an acceptable level of accuracy of the published data.
One of the great achievements of the Census differential privacy project is that we now can, for the first time, meaningfully discuss such an acceptable privacy-accuracy combination. Before the census turned to differential privacy, there was no way to formally quantify the privacy provided by its disclosure-avoidance techniques; certainly not by those of us not privy to those (unpublished) techniques—but not even, as it turns out, by our colleagues at the Census who designed and implemented these techniques (see my opening paragraph above).
Now that differential privacy allows us to formulate, investigate, and understand the privacy-accuracy trade-off problem, we can finally ask: Can we achieve an acceptable combination?
Answering this question involves two steps. First, it requires understanding the accuracy-privacy combinations that are technically achievable, given current or future technology. Second, it requires answering two normative questions: What is acceptable accuracy? What is acceptable privacy?
To study what is technically achievable, we need exactly the type of studies by Asquith et al. (2022) and Brummet et al. (2022). Of course, as these authors observe, their specific findings are by now somewhat outdated, since Census is rapidly updating its algorithm, responding quickly to user accuracy concerns by rebalancing the ways that it uses noise, to ensure high accuracy where it matters most. But the studies are useful and informative for examining the questions and considerations that must go into an analysis of the trade-off.
Using the 1940 Census—the most recent census for which the microdata are publicly available—the authors of these two studies perturb it using an early version of the Census differential privacy algorithm to achieve a range of ε levels, and use repeated runs of the perturbation to examine the distributions of several statistics of interest. For each ε level, they thus allow us to examine the accuracy loss, that is, the distribution of differences in these to-be-published statistics relative to the same statistics when calculated from the unperturbed 1940 Census data.
It is important to bear in mind that since the algorithms used by the Census are in constant flux, any specific empirical study at best illustrates the trade-offs that can currently be guaranteed, rather than the trade-offs that may be feasible as technology improves. As such, current findings of specific trade-offs should be interpreted as lower, rather than upper, bounds. That said, as real-world applications of differential privacy, these two studies put concrete numbers on a possible privacy-accuracy trade-off—numbers that we can now discuss. Since at this point differential privacy is the only game in town with provable guarantees—and a promising game, at that—the more such studies we see now, the better.1 And given the fast pace at which things move at the Census, the speed at which these and follow-up studies are completed is important too, and should be prioritized.
What do these studies find? What accuracy-privacy combinations are technically achievable using these studies’ procedures and data? I give examples below, in the context of the two normative questions above, without which we cannot interpret the findings.
What constitutes acceptable accuracy is a question that must be discussed not only by those who will use census data; importantly, it must be discussed also by those who will be affected by census data—that is, all stakeholders (or their representatives) in U.S. society. The answers to this question are application-specific. The introduction of the differential privacy toolkit in the census context opens the door for this long-overdue discussion, since it enables—for the first time—processing steps on census data that come with provable, transparent bounds on their impact on statistical accuracy.
Of course, even ‘unperturbed’ census data, before any disclosure-limitation techniques are applied, are themselves noisy due to myriad factors such as nonresponse bias, and imputations and corrections applied to the data. Any conversation about acceptable accuracy is therefore incomplete without accounting for these additional sources of error. In particular, a discussion about the impact of differential privacy on accuracy is incomplete without examining how the scale of the perturbations from privacy compares with the scale of other sources of noise, and how the various sources of noise interact: Do they amplify one another? Do they sometimes cancel one another out?
That said, for now, the two studies let us assess the additional accuracy loss at different perturbation levels, given the Census algorithm used at the time the studies were conducted, in the context of specific realistic uses of the 1940 Census data. Does the accuracy loss they find seem acceptable to these studies’ authors?
Start with Brummet et al. (2022). They investigate three applications of census data: two related to survey sampling, and one that simulates allocating funds to specific areas. Consider their fund-allocation application:
Assume a budget of $5 billion to be allocated nationwide proportional to the number of individuals <18 years old (roughly $125/child). What is the distribution of misallocated funds across enumeration districts and counties?
Brummet et al. (2022) answer this question for eight different levels of ε: 0.25, 0.50, 0.75, 1, 2, 4, 6, and 8. The team finds that across counties—generally large geographic units—per-child misallocation is modest (see their Table 8, Panel A). A county at the 10th percentile would mistakenly receive around $1.30–3.30 per child less than it should, while a county at the 90th percentile would receive $0.60–3.10 per child more than it should, for ε in the above range of 0.25 (largest misallocations) to 8 (smallest misallocations). The authors observe that even such modest allocations “may still be large enough to cause concerns for districts that depend on the funds.” Moreover, for districts—much smaller geographical units—misallocation can be quite substantial (Table 8, Panel B). While it remains less than $10 per child for ε > 1, it ranges from $15 for ε = 1 to $57 for ε = 0.25—almost half of the original per-child allocation. The authors conclude that “the noise injection can lead … to substantial misallocations of funds to local areas.”
Asquith et al. (2022) explore other applications. They look at population counts (total, White, and African American) across counties and other geographies, and at three commonly used segregation indices across counties. They too investigate the impact of the Census differential privacy algorithm, known as the Disclosure Avoidance System (DAS), on accuracy for the same range of ε values, though focusing on the subset 0.25, 1, and 8. In addition, they explore different sub-allocations of a given ε across geographic levels and queries. The authors report a rich set of findings, interspersed with suggestions regarding what in their view constitutes acceptable accuracy for particular applications—a welcome discussion that we need more of, in a broader context.
For example, looking at total-population counts across counties at the strongest privacy level they consider, ε = 0.25, “population discrepancies become common, and many counties have differences of 5% or more.” Discrepancies shrink as ε goes up or when limiting the analysis to the largest counties, but grow dramatically when replacing total population with African American counts: “For this group, population estimates vary considerably for a majority of counties, even when ε = 8 and, outside the Deep South, even for counties with above-median population.” Moving from simple counts to more complex statistics such as segregation measures, the authors conclude that they “may become entirely impractical with data that have had the DAS applied, especially in rural places or when studying low-population groups.” The authors call for the development of new, more robust, statistics, but point to a variety of issues (e.g., backward compatibility).
As mentioned, these findings raise the question of how the scale of noise from these levels of differential privacy compares with the scale of disagreement between different data sources, which in turn highlights the importance of deeper investigation of these other sources of noise. If the private census data already have unacceptable levels of noise for certain statistics, then we may judge our data accuracy unacceptable even before the additional accuracy lost due to differential privacy. It would be misleading then to place all the blame at the feet of privacy—if only implicitly, by focusing on the part of the accuracy loss that is due to privacy. Moreover, Asquith et al. (2022) mention recent research suggesting that postprocessing by the Census may in fact be responsible for most of the discrepancies between the unperturbed data and the Census-released differentially private versions—again taking blame away from privacy itself. All that said, however, the two studies do find substantial reason for concern regarding accuracy in their simulations.
As discussed, these two studies focus on evaluating accuracy for ε values in the range 0.25–8. While readily admitting that interpretation of the privacy protection at the higher end of this range is difficult, Brummet et al. (2022) explicitly refer to the lower end as “strong privacy protection.” But is it? Do ε values in any part of this range provide meaningful levels of privacy?
The theoretical computer science literature, where the differential privacy apparatus has been, and is still being developed, is mostly silent on the normative question regarding acceptable levels of ε. Indeed, what constitutes acceptable privacy is, too, a question that can only be answered by society at large. However, in many computer science papers and talks, a commonly used ε = 0.1 appears to be the example of choice. It appears therefore that ε = 0.1 has emerged as a tacitly agreed-upon privacy guarantee that provides a minimally meaningful privacy protection.
Following that literature, work that discusses the use of differential privacy in the social sciences (see, e.g., Heffetz & Ligett, 2014, and Oberski & Kreuter, 2020) often uses ε = 0.1 as a standard example in concrete calculations. Yielding a multiplicative constant of e0.1 ≈ 1.11, this de facto privacy standard implies, in the Brummet et al. (2022) example above, that the probability the perturbed 1940 Census classifies N individuals as < 18 years old is at most only 11% higher for the actual data set than for a neighboring one where one child were switched to an adult. Of course, ‘only’ is a value judgment; but it seems a defensible one.
In comparison, the levels of ε considered by the two studies above are much higher. The lowest level of ε in the studies is ε = 0.25, yielding a multiplicative constant of e0.25 ≈ 1.28 (so the probability for N observations classified as children in the published data set is at most 28% higher for the original than for the neighboring data set). The highest level of ε in the studies, ε = 8, yields an e8 ≈ 2,981 multiplicative constant, making the upper bound on the privacy guarantee all but meaningless. Overall, the range of ε levels considered by these articles provide formal upper-bound privacy guarantees that do not seem to promise very meaningful privacy protection.
In practice, things may be much worse. Individuals are likely to participate in several censuses throughout their lives. Unless we develop an implementation of the technology to leverage this fact and provide better-than-expected privacy guarantees across several censuses, or unless we develop a more nuanced language for reasoning about how census-related privacy harms accumulate across an individual’s lifetime, the reasonable effective ε budget for a single census is much lower than 0.1. Given life expectancy in the United States, the expectation nowadays is for around eight censuses in one’s lifetime. To guarantee a lifetime ε = 0.1 by perturbing each census in isolation (as these studies do), each single census would have to guarantee ε = 0.1/8 = 0.0125, that is, 20 times lower than even the lowest ε (= 0.25), and 640 times lower than the highest ε (= 8), considered in the two articles.
Stated another way, the strongest privacy guarantee considered in these two studies, ε = 0.25, adds throughout a participant’s lifetime to around ε = 2, or e2 ≈ 7.4. A guarantee that, throughout their life, census participants’ published age, sex, or race are ‘only’ 7.4 times more likely given they are the true underlying values than given the alternative, is not a strong privacy guarantee.
Of course, these back-of-the-envelope calculations are only illustrative. The Census Bureau is actually using a refined version of differential privacy to reason about the way in which privacy losses add up across computations. As a result, the cumulative harm from participation in multiple censuses grows more slowly than in the simple illustrations above. But the overall qualitative point holds.
In summary, even if the decennial census is the only survey one ever participates in, given the differential privacy technologies these articles currently implement and the constraints they have to obey, the range of ε they investigate is not even close to what the differential privacy literature itself would consider as acceptable privacy guarantees. Indeed, the range investigated is at least one to two orders of magnitude higher than a reasonable range, even under very favorable assumptions.
These studies demonstrate that we can finally have a meaningful, long-due discussion of the accuracy-privacy trade-off. But they also highlight that the preliminary Census algorithm they used was probably ‘not there yet’ in terms of acceptable privacy-accuracy trade-offs. This raises two points: ‘getting there’ is urgent, but so is setting up the societal infrastructure to evaluate whether and when algorithmic techniques are able to provide an acceptable trade-off.
There are many possible paths to ‘getting there’: better algorithms; more nuanced understanding of accuracy requirements; relaxing insistence that differentially private synthetic data obey certain ‘invariants’ that—perhaps unnecessarily—tie the algorithm’s hands; better reasoning about how privacy ‘harms add up’; exploring relaxed privacy notions; and more. We need more analysis of which of these approaches are most important to ensuring that society has meaningful privacy-accuracy combinations to select from.
To place the question of acceptable trade-offs in relief, it is perhaps informative to examine the alternative: Is society currently better off with a corner solution? Under which conditions, if any, would it be better to publish the original, unperturbed data set—with whatever other sources of noise it contains anyway—and explicitly guarantee that original level of accuracy but no privacy regarding these demographic variables?
As mentioned in my introduction, this no-privacy corner solution is approximately what has historically happened unintentionally anyway, through the publication of billions of statistics on roughly 300 million people. It is a great achievement of the Census differential privacy project that we now have the tools to consider this corner solution in a fully informed fashion, rather than as an uninformed mistake. Also, notice that the other corner solution—not using the data at all, guaranteeing full privacy—is not a viable option. Even ‘completely hiding’ the data from the public but using it as the basis for policy decisions does not provide meaningful privacy, since the data are potentially vulnerable to a reconstruction attack, on the basis of the observed decisions, which in themselves reveal something about the data.
If a no-privacy corner solution is unacceptable to society, investigating ways to get to an acceptable interior accuracy-privacy combination should be a high research priority. One avenue is going back to the question: Does the differential privacy definition used in these articles provide a privacy guarantee that is too strong?
To a social scientist, the worst-case guarantee that differential privacy provides may feel unnecessarily strong, as it protects against what are potentially extremely low-probability events. Recall that the differential privacy guarantee covers any possible combination of microdata in the data set, and any possible individual record change in creating its neighboring data sets.
To a computer scientist, however, a rare weakness is still a weakness to worry about. Intuitively, with less than the worse-case guarantee that differential privacy provides, an attacker would be able to find vulnerabilities to exploit. In addition, after all, privacy is meant to protect the weak and vulnerable, rather than the typical; it should, in that sense, be designed to protect the outliers, rather than the norm. That said, Census and industry have been carrying out promising work on careful relaxations of differential privacy. These relaxations still protect the outliers, but allow for a failure of the guarantee in only unthinkably unlikely scenarios.2
The insistence on worst-case guarantees—which, from a privacy point of view, may be fully justified—creates an asymmetry in studies of the privacy-accuracy trade-off. Brummet et al. (2022) and Asquith et al. (2022), and the differential-privacy-based approach more generally, take as a given a certain level of ε—that is, a certain worst-case privacy guarantee—and investigate the distribution, and with it the accuracy, of resulting statistics. Could things be somehow flipped? Could we think of an accuracy requirement—for example, a guarantee that no more than a threshold level (or percent) of funds are misallocated across counties or districts—and investigate the distribution of privacy guarantees that it entails?3
More generally, we need new formalism to allow us to perform the necessary cost-benefit analyses, incorporating both the probabilities and the severities of various potential accuracy and privacy harms and losses.4 I would like to see these issues discussed from philosophical, ethical, social, political, legal, and economic perspectives. We cannot move forward without this conversation. And in this conversation, we need to separate the technical-limitations discussion from the normative decisions that society must make.
Differential privacy has made this conversation possible. Now let us bring the rest of society in.
Ori Heffetz has no financial or non-financial disclosures to share for this article.
Abowd, J. M., & Schmutte, I. M. (2019). An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review, 109(1), 171–202. https://doi.org/10.1257/aer.20170627
Asquith, B. J., Hershbein, B., Kugler, T., Reed, S., Ruggles, S., Schroeder, J., Yesiltepe, S. & Van Riper, D. (2022). Assessing the impact of differential privacy on measures of population and racial residential segregation. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.5cd8024e
Brummet, Q., Mulrow, E., & Wolter, K. (2022). The effect of differentially private noise injection on sampling efficiency and funding allocations: Evidence from the 1940 Census. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.a93d96fa
Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2011). Differential privacy: A primer for the perplexed. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, WP. 26. October 26–28, Tarragona, Spain. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/26_Dwork-Smith.pdf
Heffetz, O., & Ligett, K. (2014). Privacy and data-based research. Journal of Economic Perspectives, 28(2), 75–98. https://doi.org/10.1257/jep.28.2.75
Oberski, D. L., & Kreuter, F. (2020). Differential privacy and social science: An urgent puzzle. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.63a22079
Wu, S., Roth, A., Ligett, K., Waggoner, B., & Neel, S. (2019). Accuracy first: Selecting a differential privacy level for accuracy-constrained ERM. Journal of Privacy and Confidentiality, 9(2). https://doi.org/10.29012/jpc.682
©2022 Ori Heffetz. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.