In a technical treatment, this article establishes the necessity of transparent privacy for drawing unbiased statistical inference for a wide range of scientific questions. Transparency is a distinct feature enjoyed by differential privacy: the probabilistic mechanism with which the data are privatized can be made public without sabotaging the privacy guarantee. Uncertainty due to transparent privacy may be conceived as a dynamic and controllable component from the total survey error perspective. As the 2020 U.S. Decennial Census adopts differential privacy, constraints imposed on the privatized data products through optimization constitute a threat to transparency and result in limited statistical usability. Transparent privacy presents a viable path toward principled inference from privatized data releases, and shows great promise toward improved reproducibility, accountability, and public trust in modern data curation.
Keywords: statistical inference, unbiasedness, uncertainty quantification, total survey error, privacy-utility trade-off, invariants
When conducting statistical analysis using privacy-protected data, the transparency of the privacy mechanism is a crucial ingredient for trustworthy inferential conclusions. The knowledge about the privacy mechanism enables accurate uncertainty quantification and ensures high statistical usability of the data product. This article discusses the key statistical considerations behind transparent privacy, which leads to improved reproducibility, accountability, and public trust. It weighs a few challenges to transparency that emerge from the adoption of differential privacy by the 2020 U.S. Decennial Census.
The Decennial Census of the United States is a comprehensive tabulation of its residents. For over two centuries, the census data supplied benchmark information about the states and the country, helped guide policy decisions, and provided crucial data in many branches of the demographic, social, and political sciences. The census aims to truthfully and accurately document the presence of every individual in the United States. The fine granularity of the database, compounded by its massive volume, portrays American life in great detail.
The U.S. Census Bureau is bound by Title 13 of the United States Code to protect the privacy of individuals and businesses who participate in its surveys. These surveys contain centralized and high-quality information about the respondents. If disseminated without care, they might pose a threat to the respondents’ privacy. The Bureau implements protective measures to reduce the risk of inadvertently disclosing confidential information. The first publicly available documentation of these methods dates back to 1970 (McKenna, 2018). Until the 2010 Census, statistical disclosure limitation (SDL) mechanisms deployed by the Census Bureau relied to a large extent on table suppression and data swapping, occasionally supplemented by imputation and partially synthetic data. These techniques restricted the verbatim release of confidential information through the data products. However, they do not offer an exposition of privacy protection as a goal in itself. What does the SDL mechanism aim to achieve, and how do we know whether it is actually working? The answers to these questions are not definitive. In particular, the extent of an SDL mechanism’s intrusiveness on data usability is not measured and weighed against the extent of privacy protection it affords. We now understand that many traditional SDL techniques are not just ambiguous in definition, but defective in effect, for they can be invalidated by carefully designed attacks that leverage modern computational advancements and auxiliary sources of open access information (see, e.g., Dinur & Nissim, 2003; Sweeney, 2002). With the aid of publicly available data, the Census Bureau attempted a ‘reidentification’ attack on its own published 2010 Census tabulations, and was successful in faithfully reconstructing as much as 17% of the U.S. population, or 52 million people at the level of individuals (Abowd, 2019; Hawes, 2020). These failures are a resounding rejection of the continued employment of traditional SDL methods. It is clear that we need alternative, and more reliable, privacy tools for the 2020 Census and beyond.
In pursuit of a modern paradigm for disclosure limitation, the Census Bureau endorsed differential privacy as the criterion to protect the public release of the 2020 Decennial Census data products. The Bureau openly engaged data users and sought constructive feedback when devising the new Disclosure Avoidance System (DAS). They launched a series of demonstration data product and codebase releases (U.S. Census Bureau, 2020a), and presented its design processes at numerous academic and professional society meetings, including the Joint Statistical Meeting, the 2020 National Academies of Sciences, Engineering, and Medicine (NASEM) Committee on National Statistics (CNSTAT) Workshop, and the 2019 Harvard Data Science Institute Conference in which I participated as a discussant. Reactions to this change from the academic data user communities were a passionate mix. Some cheered for the innovation, while others worried about the practical impact on the usability of differentially privatized releases. In keeping up with the inquiries and criticisms, the Census Bureau assembled and published data-quality metrics that were assessed repeatedly as the design of the 2020 DAS iterated (U.S. Census Bureau, 2020b). Through the process, the Bureau exhibited an unprecedented level of transparency and openness in conveying the design and the production of the novel disclosure control mechanism, publicizing the description of the TopDown Algorithm (Abowd et al., 2022) and the GitHub code base (2020 Census DAS Development Team, 2021). This knowledge makes a world of difference for census data users who need to analyze the privatized data releases and assess the validity and the quality of their work.
This article argues that transparent privacy enables principled statistical inference from privatized data releases. If a privacy mechanism is known, it can be incorporated as an integral part of a statistical model. Any additional uncertainty that the mechanism injects into the data can be accounted for properly. This is the most reliable way to ensure the correctness of the inferential claims produced from privatized data releases, when a calculated loss of statistical efficiency is present. For this reason, the publication of the probabilistic design of the privacy mechanism is crucial to maintaining a high usability of the privatized data product.
Part of what contributed to the failure of the traditional disclosure limitation methods is that their justification appeals to intuition and obscurity, rather than explicit rules. If the released data are masked, coarsened, or perturbed from the confidential data, it seems natural to conclude that they are less informative, and consequently more ‘private.’ Traditional disclosure limitation mechanisms are obscure, in the sense that their design details are rarely released. For swapping-based methods, not only are the swap rates omitted, the attributes that have been swapped are often not disclosed (Oganian & Karr, 2006). As a consequence, an ordinary data user would not have the necessary information to replicate the mechanism, nor to assess their performance in protecting privacy. The effectiveness of obscure privacy mechanisms is difficult to quantify.
For data analysts who utilize data releases under traditional SDL to perform statistical tasks, the opaqueness of the privacy mechanism poses an additional threat to the validity of the resulting inference. A privacy mechanism, be it suppressive, perturbative, or otherwise, works by processing raw data and modifying their values to something that may be different from what has been observed. In doing so, the mechanism injects additional uncertainty in the released data, weakening the amount of statistical information contained in them. Uncertainty per se is not a problem; if anything, the discipline of statistics devotes itself to the study of uncertainty quantification. However, in order to properly attribute uncertainty where it is due, some minimal knowledge about its generative mechanism must be known. If the design of the privacy mechanism is kept opaque, our knowledge would be insufficient for producing reliable uncertainty estimates. The analyst might have no choice but to ignore the privacy mechanism imposed on the data, and might arrive at erroneous statistical conclusions.
Differential privacy conceptualizes privacy as the probabilistic knowledge to distinguish the identity of one individual respondent in the data set. The privacy guarantee is stated with respect to a random mechanism that imposes the privacy protection. Definition 1 presents the classic and most widely endorsed notion called
The positive quantity
As a formal approach to privacy, statistical disclosure limitation mechanisms compliant with differential privacy put forth two major advantages over their former counterparts. The first is provability, a mathematical formulation against which guarantees of privacy can be definitively verified as it is conceptualized. Definition 1 puts forth a concrete standard about whether, and by how much, any proposed mechanism can be deemed differentially private, as the probabilistic property of the mechanism is entirely encapsulated by
The second major advantage of differential privacy, which this article underscores, is transparency. Differential privacy allows for the full, public specification of the privacy mechanism without sabotaging the privacy guarantee. The data curator has the freedom to disseminate the design of the mechanism, allowing the data users to utilize it and to critique it, without compromising the effectiveness of the privacy protection. The concept of transparency that concerns this article will be made precise in Section 4. As a example, below is one of the earliest proposed mechanisms that satisfies differential privacy:
The omitted proportionality constant in (2.2) is equal to
We note that differentially private mechanisms compose their privacy losses nicely. At a basic level, two separately released differentially private data products, incurring PLBs of
The preliminary versions of the 2020 Census DAS utilizes the integer counterpart to the Laplace mechanism, called the double geometric mechanism (Fioretto et al., 2021; Ghosh et al., 2012). The mechanism possesses the same additive form as the Laplace mechanism, but instead of real-valued noise
Data privatization constitutes a phase in data processing which succeeds data collection and precedes data release. When conducting statistical analysis on processed data, misleading answers await if the analyst ignores the phases of data processing and the consequences they impose.
We use an example of simple linear regression to illustrate how obscure privacy can be misleading. Regression models occupy a central role in many statistical analysis routines, for they can be thought of as a first-order approximation to any functional relationship between two or more quantities. Let
Suppose the analyst is supplied the privatized statistics
Unfortunately, no matter which computational procedure one uses, the point estimates obtained from fitting (3.3) are no longer unbiased nor consistent for
and the naïve intercept estimator
which is strictly larger than
We use a small sample simulation study (
Figure 1. 95% joint confidence regions for
The troubling consequence of ignoring the privacy mechanism is not new to statisticians. The naïve regression analysis of privatized data generalizes a well-known scenario in the measurement error literature, called the classic measurement error model. The notable biasing effect on the naïve estimator
From the additive mechanism in Definition 2, we see that the noise term
In the universe of statistical analysis, nonlinear estimators are the rule, not the exception. Many descriptive and summary statistics involve nonlinear operations such as squaring or dividing—think variances, proportions, and other complex indices3—which don’t fare well with additive noise. Ratio estimators, or estimators that involve random quantities in their denominators, can suffer from high variability if the randomness is high. Therefore, many important use cases of the census releases, as well as the assessment of the impact due to privacy, could benefit from additional uncertainty quantification. As an example, Asquith et al. (2022) evaluate a preliminary version of the 2020 Census DAS using a set of segregation metrics as the benchmark statistics and compare its effect when applied to the 1940 full-count census microdata. One of the evaluation metrics is the index of dissimilarity per county (Iceland et al., 2002):
If we were to repeatedly create privatized demonstration data sets from the DAS, and calculate the dissimilarity index each time by naïvely replacing all quantities in (3.6) with their privatized counterparts, we will witness variability in the value
Privacy adds an extra layer of uncertainty to the generative process of the published data, just as any data-processing procedures such as cleaning, smoothing, or missing data imputation. We risk obtaining misguided inference whenever blindly fitting a favorite confidential data model on privatized data without acknowledging the privatization process, for the same reason we would be misguided by not accounting for the effect of data processing. To better understand the inferential implication of privacy and obtain utility-oriented assessments, privacy shall be viewed as a controllable source of total survey error, an approach that is again made feasible by the transparency of the privatization procedure. We return to this subject in Section 5.
The misleading analysis presented in Section 3 is not the fault of differential privacy, nor of linear regression or other means of statistical modeling. Rather, obscure privacy mechanisms prevent us from performing the right analysis. Any statistical model, however adequate in describing the probabilistic regularities in the confidential data, will generally be inadequate when naïvely applied to the privatized data.
To correctly account for the privacy mechanism, statistical models designed for confidential data need to be augmented to include the additional layer of uncertainty due to privacy. In our example, the simple linear model of (3.1) is the true generative model for the confidential statistics
If a statistical model is of high quality, or more precisely self-efficient (Meng, 1994; Xie & Meng, 2017),5 its inference based on the privatized data should typically bear more uncertainty compared to that based on the confidential data. The increase in uncertainty is attributable to the privacy mechanism. Therefore, uncertainty quantification is of particular importance when it comes to analyzing privatized data. But drawing statistically valid inference from privatized data is not as simple as increasing the nominal coverage probability of confidence or credible regions from the old analysis. As we have seen, fitting the naïve linear model on differentially privatized data creates a ‘double whammy’ due to both a biased estimator and incorrectly quantified estimation uncertainty. The right analysis hinges on incorporating the probabilistic privacy mechanism into the model itself. This ensures that we capture uncertainty stemming from any potential systematic bias displayed by the estimator due to noise injection, as well as a sheer loss of precision due to diminished informativeness of the data.
For data users who currently employ analysis protocols designed without private data in mind, this suggests that modification needs to be made to their favorite tools. That sounds like an incredibly daunting task. However, on a conceptual level, what needs to be done is quite simple. We present a general recipe for the vast class of statistical methods with either a likelihood or a Bayesian justification.
When the privacy mechanism is transparent, we can write down the observed, or marginal, likelihood function for
with the notation
where the proportionality constant
The marginal likelihood for
Proof. The ‘if’ part of the theorem is trivial. For the ‘only if’ part, note that (4.4) is the same as the requirement of weak equivalence between the true posterior
where the proportionality constant
What Theorem 1 says is that, if we conceive the statistical validity of an analysis as its ability to yield the same expected answer as that implied by the correct model (that is, by properly accounting for the privatization mechanism) for a wide range of questions (reflected by the free choice of
As discussed in Section 1, traditional SDL techniques such as suppression, deidentification, and swapping rely fundamentally on procedural secrecy. While each of these methods admits a precise characterization
Transparent privacy mechanisms merit another important quality, namely parameter distinctiveness, or a priori parameter independence, from both the generative model of the true confidential data as well as any descriptive model the analyst wishes to impose on it. Parameter distinctiveness always holds since the entire privacy mechanism, all within control of the curator, is fully announced hence has no hidden dependence on the unknown inferential parameter
While conceptually simple, carrying through the correct calculation can be computationally demanding. The integral in (4.2) may easily become intractable if the statistical model is complex, if the confidential data is high-dimensional (as is the case with the census tabulations), or if a combination of both holds true. The challenge is amplified by the fact that the two components of the integral are generally not in conjugate forms. While the privacy mechanism
To answer to the demand for statistically valid inference procedures based on privatized data, Gong (2019) discusses two sets of computational frameworks to handle independently and arbitrarily specified privacy mechanisms and statistical models. For exact likelihood inference, the integration in (4.2) can be performed using Monte Carlo expectation maximization (MCEM), designed for the presence of latent variables or partially missing data and equipped with a general-purpose importance sampling strategy at its core. Exact Bayesian inference according to (4.3) can be achieved with, somewhat surprisingly, an approximate Bayesian computation (ABC) algorithm. The tuning parameters of the ABC algorithm usually control the level of approximation in exchange for Monte Carlo efficiency, or computational feasibility in complex models. In the case of privacy, the tuning parameters are set to reflect the privacy mechanism, in such a way that the algorithm outputs exact draws from the desired Bayesian posterior for any proper prior specification. I have explained this phenomenon with a catchy phrase: approximate computation on exact data is exact computation on approximate data. Private data is approximate data, and its inexact nature can be leveraged to our benefit, if the privatization procedure becomes correctly aligned with the necessary approximation that brings computational feasibility.
To continue the illustration with our running example, the MCEM algorithm is implemented to draw maximum likelihood inference for the
In introductory probability and survey sampling classrooms, the concept of a census is frequently invoked as a pedagogical reference, often with the U.S. Decennial Census as a prototype. The teacher would contrast statistical inference from a probabilistic sampling scheme with directly observing a quantity from the census, regarding the latter as the gold standard, if not the ground truth. This narrative may have left many quantitative researchers with the impression that the census is always comprehensive and accurate. The reality, however, invariably departs from this ideal. The census is a survey, and is subject to many kinds of errors and uncertainties, as are all surveys. As do coverage bias, nonresponse, erroneous and edited inputs, statistical disclosure limitation introduces a source of uncertainty into the survey, albeit unique in nature.
To assess the quality of the end data product, and to improve it to the extent possible, we construe privacy as one of the several interrelated contributors to total survey error (TSE; Groves, 2005). Errors due to privacy make up a source of nonsampling survey error (Biemer, 2010). Additive mechanisms create privacy errors that bear a structural resemblance with measurement errors (Reiter, 2019). What makes privacy errors easier to deal with than other sources of survey error, at least theoretically, is that their generative process is verifiable and manipulable. Under central models of differential privacy, the process is within the control of the curator, and under local models (i.e., the responses are privatized as they leave the respondent) it is defined by explicit protocols. Transparency brings several notable advantages to the game. Privacy errors are known to enjoy desirable properties such as simple and tractable probability distributions, statistical independence among the error terms, as well as between the errors and the underlying confidential data (i.e., parameter distinctiveness). These properties may be assumptions for measurement errors, but they are known to hold true for privacy errors. In the classic measurement error setting, the error variance needs to be estimated. In contrast, the theoretical variance of all the additive privacy mechanisms are known and public. The structural similarity between privacy errors and measurement errors allows for the straightforward adaptation of existing tools for measurement error modeling, including regression calibration and simulation extrapolation, which perform well for a wide class of generalized linear models. Other approaches that aim to remedy the effect of both missing data and measurement errors can be modified to include privacy errors (Blackwell et al., 2017a, 2017b; Kim et al., 2014; Kim et al., 2015). Most recently, steps are being taken to develop methods for direct bias correction in the regression context (Evans & King, n.d.).
Figure 2. 95% joint confidence regions for
We emphasize that the transparency of the privacy mechanism is crucial to the understanding, quantification, and control of its impact on the quality of the resulting data product from a total survey of error approach. As noted in Karr (2017), traditional disclosure limitation methods often passively interact with other data-processing and error-reduction procedures commonly applied to surveys, and the effect of such interactions is often subtle. Due to the artificial nature of all privacy mechanisms, any interaction between the privacy errors can be explicitly investigated and quantified, either theoretically or via simulation, strengthening the quality of the end data product by taking out the guesswork. It is particularly convenient that the mathematical formulation of differential privacy employs the concept of a privacy loss budget, which acts as a fine-grained tuning parameter for the performance of the procedure. The framework is suited for integration with the total budget concept and the error decomposition approach to understanding the effect of individual error constituents. The price we pay for privacy can be regarded as a trade-off with the total utility, defined through concrete quality metrics on the resulting data product—for example, the minimal mean squared error achievable by an optimal survey design, or the accuracy on the output of certain routine data analysis protocols.
An increase in the PLB will in general improve the quality of the data product. But the impact on data quality exerted by a particular choice of PLB should be understood within the specific context of application. When the important use cases and accuracy targets are identified, transparency allows for the setting of privacy parameters to meet these targets via theoretical or simulated explorations, as early as during the design phase of the survey. As an illustration, Figure 2 repeats the same regression analysis as in Figure 1, but with
When privatization is a transparent procedure, it does not merely add to the total error of an otherwise confidential survey. We have reasons to hope that it may help reduce the error via means of human psychology. A primary cause of inaccuracy in the census is nonresponse and imperfect coverage, in part having to do with insufficient public trust, both in the privacy protection of disseminated data products and in the Census Bureau’s ability to maintain confidentiality of sensitive information (boyd & Sarathy, 2022; Singer et al., 1993; Sullivan, 2020). Individual data contributors value their privacy. Through their data sharing (or rather, un-sharing) decisions, they exhibit a clear preference for privacy, which has both been theoretically studied (Ghosh & Roth, 2015; Nissim et al., 2012) and empirically measured (Acquisti et al., 2013). To the privacy-conscious data contributor, transparent privacy offers the certainty of knowing that our information is protected in an explicit and provable way that is vetted by communities of interested data users. In addition, transparent privacy enables a quantitative description of how the information from each data contributor supports fair and accurate policy decisions, which directly affect the welfare of individual respondents. Even a small progress toward instilling confidence and encouraging participation can reduce the potentially immense cost due to systematic nonresponse bias, and enhance the quality of the survey (Meng, 2018).
The algorithmic construction of differential privacy and the theoretical explorations of total survey error creates a promising intersection. We hope to see synergistic methodological developments to serve the dual purpose of efficient privacy protection and survey quality optimization. I will briefly discuss one such direction. Discussing TSE-aware SDL, Karr (2017) advocates that when additive privacy mechanisms are employed, the optimal choice of privacy error covariance should accord to the measurement error covariance. The resulting data release demonstrates superior utility in terms of closeness to the confidential data distribution in the sense of minimal Kullbeck-Leibler divergence. This proposal, when accepted into the differential privacy framework, requires generalizing the vanilla algorithms to produce correlated noise while preserving the privacy guarantee. Differential privacy researchers have looked in this direction and offered tools adaptable to this purpose. For example, Nikolov et al. (2013) propose a correlated Gaussian mechanism for linear queries, and demonstrate that it is an optimal mechanism among
Just as some gifts are more practical than others, some versions of transparent privacy are more usable than others. An example of transparent privacy that can be difficult to work with occurs when constraint—including invariants, nonnegativity, integer characteristics, and structural consistencies—must be simultaneously imposed on the differentially private queries.
Invariants are a set of exact statistics calculated based on the confidential microdata (Abowd et al., 2022; Ashmead et al., 2019). Some invariants are mandated, in the sense that all versions of the privatized data that the curator can release must accord to these values. Invariants represent use cases for which a precise enumeration is crucial. For example, the total population of each state, which serves as the basis for the allocation of House seats, must be reported exactly as enumerated as required by the U.S. Constitution.
What information is deemed invariant, and what characteristics of the confidential data should form constraints on the privatized data are ultimately a policy decision. However, constraints don’t mingle with classical differential privacy in a straightforward manner. Indeed, if a query has unbiased random noise added to it, there is no guarantee that it still possesses the same characteristics as does the noiseless version. The task of ensuring privatized census data releases to be constraint-complaint is performed by the TopDown Algorithm (Abowd et al., 2022). The algorithm consists of two phases. During the measurement phase, differentially private noisy measurements, which are counts infused with unbiased discrete Gaussian noises, are generated for each geographic level. During the estimation phase, the algorithm employs nonnegative
Due to the sheer size of the optimization problem, the statistical properties of its output do not succumb easily to theoretical explorations. However, the observed adverse effects of such processing should not strike us as unanticipated. Projective optimizations, be they
Note that an optimization algorithm that imposes invariants can still be procedurally transparent. The design of the TopDown Algorithm is documented in the Census Bureau’s publication (Abowd et al., 2022), accompanied by a suite of demonstration products and the GitHub codebase (2020 Census DAS Development Team, 2021). However, mere procedural transparency may not be good enough. In summary of the NASEM CNSTAT workshop dedicated to the assessment of the 2020 Census DAS, Hotz and Salvo (2022) note that postprocessing of privatized data can be particularly difficult to model statistically. This is because the optimization imposes an extremely complex, indeed data-dependent, function to the confidential data (Gong & Meng, 2020). As a result, the distributional description of the overall algorithm (including postprocessing), denoted as
Nevertheless, procedural transparency is a promising step toward the full transparency that is needed to support principled statistical inference. Through the design phase of the 2020 DAS for the P.L. 94-171 data products, the Census Bureau released a total of six rounds of demonstration data files in the form of privacy-protect microdata files (PPMFs). The PPMFs enabled community assessments on the DAS performance, including its accuracy targets, and to provide feedback to the Census Bureau for future improvement. These demonstration data are a crucial source of information for the data-user communities, and have supported research on the impact of differential privacy as well as postprocessing in topics such as small area population (Swanson et al., 2021; Swanson & Cossman, 2021), tribal nations (National Congress of American Indians, 2021), redistricting and voting rights measures (Cohen et al., 2022; Kenny et al., 2021).
On August 12, 2021, a group of privacy researchers signed a letter addressed to Dr. Ron Jarmin, Acting Director of the United States Census Bureau, to request the release of the noisy measurement files that accompanied the P.L. 94-171 redistricting data products (Dwork et al., 2021). The letter made the compelling case that the noisy measurement files present the most straightforward solution to the issues that arise due to postprocessing. Since the noisy measurements are already formally private, releasing these files does not pose an additional threat to the privacy guarantee that the Bureau already offers. On the other hand, they will allow researchers to quantify the biases induced by postprocessing and to conduct correct uncertainty quantification. In the report Consistency of Data Products and Formal Privacy Methods for the 2020 Census, JASON (2022, p. 8) makes the recommendation that the Bureau “should not reduce the information value of their data products solely because of fears that some stakeholders will be confused by or misuse the released data.” It makes an explicit call for the release of all noisy measurements used to produce the released data products that do not unduly increase disclosure risk, and the quantification of uncertainty associated with the publicized data products. On April 28–29, 2022, a workshop dedicated to articulating a technical research agenda for statistical inference on the differentially private census noisy measurement files took place at Rutgers University, gathering experts from domains of social sciences, demography, public policy, statistics, and computer science. These efforts reflect the shared recognition among the research and policy communities that access to the census noisy measurement files, and its associated transparency benefits, are both crucial and feasible within the current disclosure avoidance framework that the Census Bureau employs.
The evolution of privacy science over the years reflects the growing dynamic among several branches of data science, as they collectively benefit from vastly improved computational and data storage abilities. What we’re witnessing today is a paradigm shift in the science of curating official, social, and personal statistics. A change of this scale is bound to exert seismic impact on the ways that quantitative evidence is used and interpreted, raising novel questions and opportunities in all disciplines that rely on these data sources. The protection of privacy is not just a legal or policy mandate, but an ethical treatment of all individuals who contribute to the collective betterment of science and society with their information. As privacy research continues to evolve, an open and cross-disciplinary conversation is the catalyst to a fitting solution. Partaking in this conversation is our opportunity to defend democracy in its modern form: underpinned by numbers, yet elevated by our respect for one another as more than just numbers.
Ruobin Gong wishes to thank Xiao-Li Meng for helpful discussions, and five anonymous reviewers for their comments.
Ruobin Gong’s research is supported in part by the National Science Foundation (DMS-1916002).
2020 Census DAS Development Team. (2021). DAS 2020 redistricting production code release [Accessed: 05-31-2022]. https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code
Abowd, J. M. (2019). Staring down the database reconstruction theorem. https://www2.census.gov/programs-surveys/decennial/2020/resources/presentations-publications/2019-02-16-abowd-db-reconstruction.pdf
Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., & Zhuravlev, P. (2022). The 2020 Census disclosure avoidance system TopDown Algorithm. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.529e3cb9
Abowd, J. M., & Schmutte, I. M. (2016). Economic analysis and statistical disclosure limitation. Brookings Papers on Economic Activity, 2015(1), 221–293. https://www.brookings.edu/wp-content/uploads/2015/03/AbowdText.pdf
Acquisti, A., John, L. K., & Loewenstein, G. (2013). What is privacy worth? The Journal of Legal Studies, 42(2), 249–274. https://doi.org/10.1086/671754
Ashmead, R., Kifer, D., Leclerc, P., Machanavajjhala, A., & Sexton, W. (2019). Effective privacy after adjusting for invariants with applications to the 2020 census (tech. rep.). https://github.com/uscensusbureau/census2020-das-e2e/blob/d9faabf3de987b890a5079b914f5aba597215b14/doc/20190711_0941_Effective_Privacy_after_Adjusting_for_Constraints__With_applications_to_the_2020_Census.pdf
Asquith, B., Hershbein, B., Kugler, T., Reed, S., Ruggles, S., Schroeder, J., Yesiltepe, S., & Van Riper, D. (2022). Assessing the impact of differential privacy on measures of population and racial residential segregation. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.5cd8024e
Avella-Medina, M. (2021). Privacy-preserving parametric inference: A case for robust statistics. Journal of the American Statistical Association, 116(534), 969–983. https://doi.org/10.1080/01621459.2019.1700130
Barrientos, A. F., Williams, A. R., Snoke, J., & Bowen, C. (2021). Differentially private methods for validation servers: A feasibility study on administrative tax data (tech. rep.). Urban Institute.
Berger, J. O. (1990). On the inadmissibility of unbiased estimators. Statistics & Probability Letters, 9(5), 381–384. https://doi.org/10.1016/0167-7152(90)90028-6
Biemer, P. P. (2010). Total survey error: Design, implementation, and evaluation. Public Opinion Quarterly, 74(5), 817–848. https://doi.org/10.1093/poq/nfq058
Blackwell, M., Honaker, J., & King, G. (2017a). A unified approach to measurement error and missing data: Details and extensions. Sociological Methods & Research, 46(3), 342–369. https://doi.org/10.1177/0049124115589052
Blackwell, M., Honaker, J., & King, G. (2017b). A unified approach to measurement error and missing data: Overview and applications. Sociological Methods & Research, 46(3), 303– 341. https://doi.org/10.1177/0049124115585360
boyd, d., & Sarathy, J. (2022). Differential perspectives: Epistemic disconnects surrounding the U.S. Census Bureau’s use of differential privacy. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.66882f0e
Bun, M., & Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In M. Hirt & A. Smith (Eds.), Theory of cryptography (pp. 635–658). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-53641-4_24
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in nonlinear models: A modern perspective. Chapman-Hall/CRC.
Cohen, A., Duchin, M., Matthews, J., & Suwal, B. (2022). Private numbers in public policy: Census, differential privacy, and redistricting. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.22fd8a0e
Devine, J., Borman, C., & Spence, M. (2020). 2020 Census disclosure avoidance improvement metrics. https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/disclosure-avoidance-system/2020-03-18-2020-census-da-improvement-metrics.pdf
Dimitrakakis, C., Nelson, B., Mitrokotsa, A., & Rubinstein, B. I. P. (2014). Robust and private Bayesian inference. In P. Auer, A. Clark, T. Zeugmann, & S. Zilles (Eds.), Algorithmic learning theory (pp. 291–305). Springer International Publishing. https://doi.org/10.1007/978-3-319-11662-4_21
Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 202–210). ACM. https://doi.org/10.1145/773153.773173
Dwork, C., King, G., Greenwood, R., Adler, W. T., Alvarez, J., Ballesteros, M., Beck, N., Bouk, D., boyd, d., Brehm, J., Bun, M., Cohen, A., Cook, C., Desfontaines, D., Evans, G., Flaxman, A. D., Franzeses, R. J., Gaboardi, M., Geambasu, R., . . . Zhang, L. (2021). Request for release of “noisy measurements file” by September 30 along with redistricting data products [Letter to Dr. Ron Jarmin, Acting Director, United States Census Bureau, Aug. 12, 2021].
Dwork, C., & Lei, J. (2009). Differential privacy and robust statistics. In Proceedings of the Forty- First Annual ACM Symposium on Theory of Computing (pp. 371–380). https://doi.org/10.1145/1536414.1536466
Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi & T. Rabin (Eds.), Lecture notes in computer science: Vol. 3876. Theory of cryptography (pp. 265–284). Springer. https://doi.org/10.1007/11681878_14
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/0400000042
Dwork, C., & Rothblum, G. N. (2016). Concentrated differential privacy. arXiv. https://doi.org/10.48550/arXiv.1603.01887
Evans, G., & King, G. (in press). Statistically valid inferences from differentially private data releases, with application to the facebook URLs dataset. Political Analysis.
Fioretto, F., Van Hentenryck, P., & Zhu, K. (2021). Differential privacy of hierarchical census data: An optimization approach. Artificial Intelligence, 296, 103475. https://doi.org/10.1016/j.artint.2021.103475
Ghosh, A., & Roth, A. (2015). Selling privacy at auction. Games and Economic Behavior, 91, 334– 346. https://doi.org/10.1016/j.geb.2013.06.013
Ghosh, A., Roughgarden, T., & Sundararajan, M. (2012). Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing, 41(6), 1673–1693. https://doi.org/10.1137/09076828X
Gong, R. (2019). Exact inference with approximate computation for differentially private data via perturbations. arXiv. https://doi.org/10.48550/arXiv.1909.12237
Gong, R., & Meng, X.-L. (2020). Congenial differential privacy under mandated disclosure. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference (pp. 59–70). https://doi.org/10.1145/3412815.3416892
Groves, R. M. (2005). Survey errors and survey costs (Vol. 581). John Wiley & Sons.
Hawes, M. B. (2020). Implementing differential privacy: Seven lessons from the U.S. 2020 Census. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.353c6f99
Hotz, V. J., & Salvo, J. (2022). A chronicle of the application of differential privacy to the 2020 Census. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.ff891fe5
Iceland, J., Weinberg, D. H., & Steinmetz, E. (2002). Racial and ethnic residential segregation in the United States: 1980–2000. Census 2000 Special Reports. https://www.census.gov/content/dam/Census/library/publications/2002/dec/censr-3.pdf
JASON. (2022). Consistency of data products and formal privacy methods for the 2020 Census. https://www2.census.gov/programs-surveys/decennial/2020/program-management/planning-docs/2020-census-data-products-privacy-methods.pdf
Karr, A. F. (2017). The role of statistical disclosure limitation in total survey error. In P. P. Biemer, E. D. de Leeuw, S. Eckman, B. Edwards, F. Kreuter, L. E. Lyberg, N. C. Tucker, & B. T. West (Eds.), Total survey error in practice (pp. 71–94). John Wiley & Sons. https://doi.org/10.1002/9781119041702.ch4
Karr, A. F., & Reiter, J. (2014). Using statistics to protect privacy. Privacy, big data, and the public good: Frameworks for engagement. Cambridge University Press.
Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T., Simko, T., & Imai, K. (2021). The use of differential privacy for census data and its impact on redistricting: The case of the 2020 US Census. Science Advances, 7(41), Article eabk3283. https://doi.org/10.1126/sciadv.abk3283
Kifer, D., Smith, A., & Thakurta, A. (2012). Private convex empirical risk minimization and high- dimensional regression. In S. Mannor, N. Srebro, & R. C. Williamson (Eds.), Proceedings of the 25th annual conference on learning theory (pp. 25.1–25.40). PMLR. https://proceedings.mlr.press/v23/kifer12.html
Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P., & Wang, Q. (2015). Simultaneous edit-imputation for continuous microdata. Journal of the American Statistical Association, 110(511), 987–999. https://doi.org/10.1080/01621459.2015.1040881
Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H., & Karr, A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. Journal of Business & Economic Statistics, 32(3), 375–386. https://doi.org/10.1080/07350015.2014.885435
Little, R., & Rubin, D. (2014). Statistical analysis with missing data. Wiley.
McKenna, L. (2018). Disclosure avoidance techniques used for the 1970 through 2010 Decennial Censuses of Population and Housing (tech. rep.). U.S. Census Bureau. https://www.census.gov/content/dam/Census/library/working-papers/2018/adrm/Disclosure%20Avoidance%20Techniques%20for%20the%201970-2010%20Censuses.pdf
McSherry, F., & Talwar, K. (2007). Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07) (pp. 94–103). https://doi.org/10.1109/FOCS.2007.66
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4), 538–558. https://doi.org/10.1214/ss/1177010269
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF
Nikolov, A., Talwar, K., & Zhang, L. (2013). The geometry of differential privacy: The sparse and approximate cases. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing (pp. 351–360). https://doi.org/10.1145/2488608.2488652
Nissim, K., Orlandi, C., & Smorodinsky, R. (2012). Privacy-aware mechanism design. In Proceedings of the 13th ACM Conference on Electronic Commerce (pp. 774–789). https://doi.org/10.1145/2229012.2229073
Oberski, D. L., & Kreuter, F. (2020). Differential privacy and social science: An urgent puzzle. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.63a22079
National Congress of American Indians. (2021). 2020 Census Disclosure Avoidance System: Potential impacts on tribal nation census data (tech. rep.).
Oganian, A., & Karr, A. F. (2006). Combinations of SDC methods for microdata protection. In J. Domingo-Ferrer & L. Franconi (Eds.), Lecture notes in computer science: Vol. 4302. Privacy in statistical databases (pp. 102–113). Springer Berlin Heidelberg. https://doi.org/10.1007/11930242_10
Raghunathan, T. E., Reiter, J. P., & Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), 1–16.
Reimherr, M., & Awan, J. (2019). KNG: The k-norm gradient mechanism. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems. Curran Associates, Inc. https://proceedings.neurips. cc/paper/2019/file/faefec47428cf9a2f0875ba9c2042a81-Paper.pdf
Reiter, J. P. (2019). Differential privacy and federal data releases. Annual review of statistics and its application, 6, 85–101. https://doi.org/10.1146/annurev-statistics-030718-105142
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
Rubin, D. B. (1993). Satisfying confidentiality constraints through the use of synthetic multiply imputed microdata. Journal of Official Statistics, 9(2), 461–468.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Singer, E., Mathiowetz, N. A., & Couper, M. P. (1993). The impact of privacy and confidentiality concerns on survey participation the case of the 1990 US Census. Public Opinion Quarterly, 57(4), 465–482. https://doi.org/10.1086/269391
Stigler, S. M. (2016). The seven pillars of statistical wisdom. Harvard University Press.
Sullivan, T. A. (2020). Coming to our census: How social statistics underpin our democracy (and republic). Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.addb8baf
Swanson, D. A., Bryan, T. M., & Sewell, R. (2021). The effect of the differential privacy disclosure avoidance system proposed by the Census Bureau on 2020 Census products: Four case studies of census blocks in Alaska. https://www.populationassociation.org/blogs/paa-web1/2021/ 03/30/the-effect-of-the-differential-privacy-disclosure
Swanson, D. A., & Cossman, R. E. (2021). The effect of the differential privacy disclosure avoidance system proposed by the Census Bureau on 2020 census products: Four case studies of census blocks in Mississippi. https://www.ncsl.org/Portals/1/Documents/Elections/Four_Case_ Studies_of_Census_Blocks_in_Mississippi.pdf
Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557–570. https://doi.org/10.1142/S0218488502001648
U.S. Census Bureau. (2019). Memorandum 2019.25: 2010 demonstration data products - design parameters and global privacy-loss budget. https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/plan/memo-series/2020-memo-2019_25.html
U.S. Census Bureau. (2020a). 2010 demonstration data products. https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance/2010-demonstration-data-products.html
U.S. Census Bureau. (2020b). 2020 disclosure avoidance system updates. https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance/2020-das-updates.html
U.S. Census Bureau. (2021a). 2020 Census: Redistricting file (Public Law 94-171) dataset. https://www.census.gov/data/datasets/2020/dec/2020-census-redistricting-summary-file-dataset.html
U.S. Census Bureau. (2021b). Census Bureau sets key parameters to protect privacy in 2020 Census results. https://www.census.gov/newsroom/press-releases/2021/2020-census-key-parameters.html
U.S. Census Bureau. (2021c). Privacy-loss budget allocation 2021-06-08. https://www2.census.gov/ programs-surveys/decennial/2020/program- management/data-product-planning/2010- demonstration-data-products/01- Redistricting_File-- PL_94- 171/2021- 06- 08_ppmf_ Production_Settings/2021-06-08-privacy-loss_budgetallocation.pdf
Van Riper, D., Kugler, T., & Schroeder, J. (2020). IPUMS NHGIS privacy-protected 2010 Census demonstration data. IPUMS. https://www.nhgis.org/privacy-protected-2010-census-demonstration-data
Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309), 63–69.
Williams, O., & Mcsherry, F. (2010). Probabilistic inference and differential privacy. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems. Curran Associates, Inc. https://proceedings.neurips.cc/paper/ 2010/file/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf
Xie, X., & Meng, X.-L. (2017). Dissecting multiple imputation from a multi-phase inference perspective: What happens when god’s, imputer’s and analyst’s models are uncongenial? Statistica Sinica, 27, 1485–1545. https://doi.org/10.5705/ss.2014.067
Zhu, K., Van Hentenryck, P., & Fioretto, F. (2021). Bias and variance of post-processing in differential privacy. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 11177–11184. https://ojs.aaai.org/index.php/AAAI/article/view/17333
Here we state a central limit theorem for the naïve slope estimator
The biasing coefficient
Figure A.1. Biasing Effect of privacy noise in linear regression. Left: large sample 95% distribution limits of the naïve slope estimator
setting, and meets its nominal coverage for all
We now supply the proof of Theorem 2, which gives a large sample approximation to the distribution of the naïve regression slope estimator for privatized data, which takes the form of
Using independence between
We have that
noting that for each