Harnessing the Known Unknowns: Di�erential Privacy and the 2020 Census

and relative accuracy population counts in total and by race for multiple geographic levels and compare commonly used measures of residential segregation. how the accuracy varies by the global privacy loss budget and by the allocation of the privacy loss budget to geographic levels and queries. The also that can indicate either notably or notably segregation in


Introduction
This special issue, Differential Privacy for the 2020 U.S. Census: Can We Make Data Both Private and Useful?, provides an entry point to help data scientists across many disciplines adjust to a big change in a key component of our national data infrastructure. The United States Census Bureau is adopting formal differential privacy protections for public products from the 2020 U.S. Decennial Census. This is the first time that a country has released most of its subpopulation counts with formal privacy protections, although certainly not the first time that other official counts have been perturbed for the purpose of disclosure avoidance.
Population censuses are important. Indeed, they may be the oldest statistical products of communal societies.
They are mentioned in the Bible (the book of Numbers) and required by Article I, Section 2 of the U.S. Furthermore, a number of active and influential research communities depend upon decennial census data products.
Privacy protection for respondents is also important and getting more difficult to achieve. Such protection has long been required by law, in order to prevent harm and to encourage full and honest responses. Recently, though, growing uses of the decennial census, availability of other data sources, and increased computational firepower make protecting the privacy of census respondents more difficult.
Fortunately, newly developed formal privacy protection systems can both measure the degree of privacy protection and allow adequate transparency to inform statistical inference on protected data. Previous statistical methods used to protect privacy (such as suppression and swapping observations) lack both of these desirable properties.
Harvard Data Science Review • Special Issue 2: Di erential Privacy for the 2020 U.S. Census Harnessing the Known Unknowns: Di erential Privacy and the 2020 Census 3 Nevertheless, adopting a new form of privacy protection for such important data is far from easy. Some of the key challenges include implementation issues confronted by the Census Bureau, understanding analytical implications for data scientists, and managing communication so that all stakeholders can engage effectively with each other and inform the public about the implications of the change.

History of This HDSR Special Issue
After extensive research on the success of reconstruction attacks on the 2010 Decennial Census data and privacy protection systems and deliberation by its Data Stewardship Executive Policy Committee, the Census Bureau announced its intention to adopt formal differential privacy (DP) protections for the 2020 Decennial Census in 2018 (Abowd, 2018;U.S. Census Bureau, 2018). In 2019, the bureau began releasing files to help data users adjust and provide input to the changes. Two releases are of particular interest for this volume: (1) DP-protected data from the1940 Decennial Census to accompany unprotected microdata files that were already in the public domain, and (2) DP-protected demonstration files using 2010 Decennial Census data.
Recognizing the importance of this change for many branches of data science, John Eltinge and Erica L.
Groshen (both HDSR co-editors at the time) and HDSR Editor-in-Chief Xiao-Li Meng decided to dedicate a session in the HDSR inaugural symposium to this topic. The goals were to sponsor research and produce an HDSR special issue that would identify implications of the new protections for research and provide a basis for constructive interdisciplinary dialogue among data users and with the Census Bureau. Three research teams agreed to independently analyze the protected and unprotected 1940 Census data in ways similar to what they expected to do with 2020 Census data. A diverse set of discussants also agreed to provide their views on the Census Bureau's approach and the results. And, the Census Bureau provided their experts to summarize their efforts to date and going forward. This very lively Differential Privacy for the 2020 Census session took place on October 25, 2019.
Since then, the Census Bureau's plans have evolved. In particular, the Bureau made numerous changes to optimize the performance of their DP algorithms and decided to inject less noise than done in the 1940 and 2010 demonstration data products. In the meantime, plans for this special issue also evolved. In particular, the issue has been enriched substantially by contributions after the conference. Two of the 1940 Census analytics teams found they were doing similar but complementary analyses, so they combined their efforts into a single paper that became Asquith et al. (2022). Some conference discussants chose to contribute papers rather than short discussions. HDSR published closely related papers that we felt merited republication in the print version of this special issue. The Census Bureau contributed three substantive background papers that will serve as key references on this topic going forward. We also solicited a couple of additional articles that add valuable analysis and insights to the issue.

Goals of This HDSR Special Issue
We aim for this special issue to help document, contextualize, and assess the Census Bureau's adoption of differential privacy and the debates around this decision. In doing so, the special issue will serve to inform stakeholders as they try to make use of the 2020 Census data products and understand the impact of differential privacy on such uses of the data products uses. This issue will also provide a reference point for future applications of differential privacy, both within and outside official statistics, recording challenges encountered and solutions found so that they do not need to be rediscovered anew. We also hope that the diverse readership and contributors to our special issue (and HDSR in general) will help to spread (both within and across disciplines) important insights and information about this change.
Concretely, the articles in this special issue address three central questions: Why and how did the Census Bureau adopt differential privacy? The adoption of differential privacy needs to be understood within the history of statistical disclosure control for census data and the bureau's competing obligations to provide useful official statistics and ensure confidentiality for respondents. Moreover, differential privacy is not a single method to simply be deployed. Rather, it is a framework for measuring the privacy protections of a disclosure avoidance system (DAS), and there are many choices for how to implement a DAS while limiting the privacy loss under differential privacy. Thus, the Census Bureau engaged in a multiyear iterative process of releasing demonstration data products (such as the 1940 data that was the subject of the October 2019 HDSR session that was the seed for this special issue), soliciting feedback from stakeholders, and modifying the differentially private algorithms before settling on the final DAS used to release the first of the 2020 Census data products (namely, the P.L. 94-171 Redistricting Data Summary File).
Several articles in this special issue trace the Census Bureau's adoption of differential privacy from distinct perspectives. Three of these articles are written by experts at the Census Bureau, reflecting different aspects of the government's work. Abowd et al. (2022) present both the rationale for and the design of the differentially private TopDown Algorithm (TDA) used to produce the 2020 Census Redistricting Data (U.S. Census Bureau, 2021). Hawes (2020) discusses the challenges and lessons that the Bureau has come to learn through its process of implementing differential privacy. Eltinge (2022) deliberates the theoretical considerations in balancing privacy constraints and data quality from the perspective of official statistics agencies. Additionally, Sullivan (2020) underscores the pressing need to protect privacy as a defense to public trust and to the quality of the census.
Will the released differentially private data be fit for use? All of the methods for ensuring differential privacy involve introducing 'random noise' into the calculation of statistics so as to hide the contribution of any individual respondent. Differential privacy provides an accounting framework to measure and control the cumulative privacy loss incurred over all of the statistics calculated and published based on the census. Thus, differential privacy exposes two inherent tensions. One is the privacy-accuracy trade-off; providing a greater Harnessing the Known Unknowns: Di erential Privacy and the 2020 Census 5 level of privacy requires introducing more noise, which leads to less accuracy. The other is the choice of which statistics (and, thus, which uses) to prioritize for accuracy; making some statistics more accurate requires making others less accurate in order to maintain the same level of privacy for respondents. Previous disclosure avoidance methods, such as data swapping, also had privacy-accuracy trade-offs and prioritized some queries over others, but those methods had far less transparency and opportunity for public input. Hence, it was and remains crucial to evaluate the fitness for use of data produced by the differentially private algorithms employed, both to inform the development of the DAS itself and to inform data users about data quality.
To this end, as invited speakers at the 2019 HDSR session, Asquith et al. (2022) and Brummet et al. (2022) present assessments of the quality of the DAS demonstration data in various use cases including survey sampling, federal funding allocation, and measurements of residential segregation.  investigate the impact of the TopDown Algorithm on the ability to accurately perform the analyses needed for elections and redistricting. One discussant at the conference, Heffetz (2022), highlights the multifaceted nature of the privacy-accuracy trade-off and calls for a public discourse of its social and ethical interpretations. A second discussant, Gong (2022), underscores the importance of transparency of privacy mechanisms for drawing reliable statistical inferences.

What was the debate about and how do we move forward?
The Bureau's decision to adopt differential privacy was met by resistance, often heated, from several data-user communities. There are a range of factors that likely contributed to the tension. Some of the debate was the result of trade-offs that were not due to adopting differential privacy, but rather were exposed by it. As mentioned above, differential privacy makes explicit the trade-offs between privacy and accuracy, and between the accuracy for different statistics and use cases.
Stakeholders who were previously accustomed to treating census data as if it were ground truth, with the disclosure avoidance and other sources of error hidden from view, are now forced to negotiate with the Bureau and each other for accuracy. Given the vast range of uses of census data (including redrawing voting districts for legislative elections, resource allocation, social science research, and policy and private sector analysis) and different priorities placed on these uses and on privacy by different stakeholders, satisfying everyone would be impossible. There is not a unique 'optimal' solution, but rather a range of possibilities that require a policy decision to select among. Thus, it is natural for academic research papers to reach different conclusions depending on what is being measured or evaluated; at the same time, we need to be cautious about political actors exploiting the legitimate discourse for their own gain.
In their capacity as co-chairs to the National Academies of Sciences, Engineering, and Medicine (NASEM) Committee of National Statistics (CNSTAT) Workshop and Expert Meetings on Census Data Quality, Hotz and Salvo (2022) chronicle the development of the Bureau's disclosure avoidance technology over the years, and the various challenges pertaining to the differential privacy revolution in 2020. Drawing from ethnographic fieldwork and theories from science and technology studies, boyd and Sarathy (2022) present a parallel story from the perspective of a variety of stakeholders. Oberski and Kreuter (2020)

discuss the scholarly interactions
Harnessing the Known Unknowns: Di erential Privacy and the 2020 Census 6 between differential privacy technologies and social scientific insights. Groshen and Goroff (2022) provide essential information and recommendations to social scientists as they approach the analysis of privacyprotected 2020 Decennial Census data.

How the Articles Fit Together
Each of the articles published in this special issue comes from a particular standpoint and addresses at least one of the above questions, and often more than one. We note also that each article reflects a specific moment in time, as both the public discourse and Census DAS algorithm have evolved between the October 2019 HDSR session and the present, and some of the articles were completed long before the publication of this special issue.

Empirical Evaluations
The following articles focus mostly on fitness for use, but in so doing, they inform questions about how differential privacy was implemented and the debate over its impact. They analyze how the additional noise added by privacy protection might affect standard uses of 2020 Decennial Census data. The first two articles (Asquith et al. [2022] and Brummet et al. [2022]) are based on the 1940 Census data created by the team at IPUMS (Integrated Public Use Microdata Series) using the Census Bureau's software and the demonstration products released by the Bureau in 2019. (Importantly, readers should note that for both the 1940 and 2010 demonstration data products, the Census Bureau injected notably more noise than they subsequently injected into 2020 Decennial Census data products. In addition, these evaluations only assess the apparent fitness for use of privatized data, because the analyses were conducted on the 1940 demonstration data as is, without statistically accounting for the privacy protections that had been applied). The third article  is consider three separate uses of the decennial census data: (1) oversampling populations in surveys, (2) screening operations for surveys of rare populations, and (3) allocating federal funds to specific areas. They find that for use cases that involve large populations, the effects of noise injection are relatively modest, but for rare populations and small areas, sampling-frame coverage special issues and misallocations of funds can be severe.
In "Private Numbers in Public Policy: Census, Differential Privacy, and Redistricting" (2022), the authors (Aloni Cohen, Moon Duchin, JN Matthews, and Bhushan Suwal) consider applications of the decennial census data in redistricting, where the data is used to balance districts, to describe the demographic composition of districts, and to detect signals of racial polarization. Based on a close look at nine localities in Texas and Arizona, they find reassuring evidence that TopDown did not threaten these redistricting functions, relative to legal and practical standards already in play. They also compare the discrepancies introduced by TopDown to previously documented sources of error in census data.

Commentary and Critique
In his reflections, "What Will It Take to Get to Acceptable Privacy-Accuracy Combinations?" (2022)  In "Transparent Privacy Is Principled Privacy" (2022), Ruobin Gong, a statistician at Rutgers University, discusses an important advantage brought forth by differential privacy: the transparency of the privacy mechanism. The probabilistic mechanism with which data is privatized can be made public without harming the privacy guarantee, and can be leveraged using statistical methodologies to obtain trustworthy inference from privatized data. Gong further argues that mandated invariants imposed on the privatized data may diminish transparency and result in limited statistical usability, and calls for the release of the pre-'postprocessed' census noisy measurements to support social and scientific research. Munich, Germany. In their article, Oberski and Kreuter argue that "the discussion on implementing differential privacy has been clouded by incompatible subjective beliefs about risk, each perspective having merit for different data types." Second, they study both challenges and positive consequences for social science research if differential privacy is widely implemented. They conclude with a call for interdisciplinary collaboration to solve the urgent puzzle that differential privacy raises. improvements in the public data release. At the same time, the authors raise a number of issues that have become clearer since the CNSTAT Workshop. These topics concern the differences between the theoretical underpinning of the Bureau's DAS and the consequences of its application in practice. In light of the latter findings, Hotz and Salvo highlight the need for the Census Bureau to conduct a new round of simulated reconstruction and reidentification attacks on the 2020 Census products that it plans to release. They also call for a reexamination of the Census Bureau's legal obligation to protect the confidentiality of its respondents in light of the modern data technology landscape.

Broader Perspectives
In "Disclosure Avoidance and the 2020 Census: What Do Researchers Need to Know?" (2022), Erica L.
Groshen and Daniel L. Goroff address all three questions summarized in this foreword, as they provide essential information to social science researchers who will analyze the privacy-protected 2020 Decennial Census's approach to disclosure avoidance, as well as what seems new but is actually little changed from recent censuses. They also examine strategies, trade-offs, and rationales associated with processing and releasing the decennial results. Finally, they offer specific conclusions to the Census Bureau and researchers to help promote appropriate and well-informed analysis of 2020 Census data.

Bottom Line
Of necessity, this special issue is an interim report. From the beginning, given the complexity of the transformation and the duration of the implementation, we did not expect to provide the final word on the use of differential privacy in the 2020 Census. Rather, we seek to help assemble a foundation that future research and policymaking can build on, both within and beyond the context of disclosure avoidance for census data. In that spirit, we offer some high-level lessons learned thus far.
1. Lest anyone think otherwise, modernizing disclosure avoidance protections for 2020 Census products is necessary and has far-reaching implications. This change will affect many stakeholders (including the public, the Census Bureau, researchers, and policymakers) in many ways. The change is not that disclosure avoidance distortions and other types of error are new for decennial census products. The change is that stakeholders are directly confronted by the accuracy-privacy trade-offs (rather than these being managed by the bureau behind the scenes) and related norms and practices for data analysis and communication, where We hope that readers will find this special issue useful as they prepare for changes ahead and for orienting themselves in the constellation of stakeholders involved in the implementation of formal privacy for the 2020 Decennial Census. Given the political implications of the data, many public exchanges about the 2020 Decennial Census data have been highly contentious. By contrast, we are heartened by the constructive discourse represented in this special issue, where the authors are driven by a shared goal of finding the best just begun. Using the new data appropriately will require analysts to rethink methods and sources used to analyze decennial censuses. Their experiences and requests will provide valuable feedback to the Census Bureau. The Census Bureau's plans for products and access will necessarily continue to adjust over the next few years, if not longer. Furthermore, even once 2020 products and processes are settled, new ones will need to be devised as part of advance planning for the 2030 Decennial Census, which is already underway.
3. The complexities of implementation, not to mention delays due to the COVID-19 pandemic, have required the Census Bureau to make many midstream adjustments. While privacy protections and noise injection are not new, this is the first instance of a statistical agency applying formal privacy protections to a population census. The Census Bureau has had to negotiate a variety of unexpected technical, legal, organizational, personnel, funding, and political challenges throughout the implementation process. As a consequence, preparation and keeping abreast of plans, progress, and implications have been difficult for observers.
4. Many stakeholders have a lot of work ahead. For example, analysts who rely directly and indirectly on decennial census products to make statistical inferences may need to reexamine their choices of data sources and methodologies and adjust them appropriately. In another realm, policymakers likely should reconsider the appropriateness of existing triggers, such as knife-edge program qualification criteria (for example, Housing and Urban Development Community Block Grants, Rural Business Development Grants, and the Rural Microentrepreneur Assistance Program), that will now rely on intentionally 'fuzzed' data.
Furthermore, technical data users need to build more holistic frameworks for measuring various forms of uncertainty. Noise injected through DAS intersects other known and unknown sources of error in the data.
Little is understood about their interactions, and much more research and evaluation are needed.
5. Future privacy protection efforts for official data will reflect lessons learned from this effort. This experience will help the Census Bureau and the other statistical agencies as they decide if and how to extend formal privacy protections to other data series in the years to come. Yet, observers should avoid leaping too quickly to conclusions about likely impact on any particular data set, because the exact implementation of differential privacy protections must be specific to the nature of the data involved. Furthermore, litigation or legislation on some of the topics raised here may be inevitable and could also affect future implementation.
6. Communication across perspectives and disciplines is essential, challenging, and needs to start early. For some stakeholders, this change is long overdue, even as others feel blindsided and question the science and the urgency. More communication among the various communities could help convert these gaping crossdisciplinary disconnects into opportunities for fruitful collaboration. Indeed, these exchanges should be underway well before the changes impact people's work.