Skip to main content
SearchLoginLogin or Signup

A Chronicle of the Application of Differential Privacy to the 2020 Census

Published onJun 24, 2022
A Chronicle of the Application of Differential Privacy to the 2020 Census
·

Abstract

In this article, we chronicle the U.S. Census Bureau’s development of the Disclosure Avoidance System (DAS) for the publicly released products of the 2020 Census of Population. We provide a brief history of the Census Bureau’s fulfillment of its dual mission of conducting and disseminating the constitutionally mandated decennial information on the U.S. population and its promise of safeguarding the confidentiality of that information. We discuss the basis for and development of a new DAS for released data products from the 2020 Census and the evidence that emerged from various user communities on the accuracy and usability of data produced under this new DAS. We offer some assessments of this experience, the dilemmas and challenges that the Census Bureau faces for producing usable data while safeguarding the confidentiality of the information it collects, and some recommendations for addressing these challenges in the future.

Keywords. 2020 Census, privacy and confidentiality, differential privacy, disclosure avoidance systems


1. Introduction: The Bureau’s Mission and the Promise of Confidentiality

The decennial census is a pillar of the nation’s statistical system because a complete count of the population is the foundation for so many activities that define who we are as a country. As per Article 1, Section 2 of the U.S. Constitution, it is the basis for determining apportionment of seats in the House of Representatives. Census data are generally used to draw electoral districts at federal, state, and local levels, and—directly or indirectly—are the basis for distributing more than $1.5 trillion from the federal government to states and localities annually (Reamer, 2019). From a statistical standpoint, the decennial census defines the universe for dozens of surveys, which serve as the basis for estimating a large array of demographic, social, and economic characteristics of the population throughout the decade. The decennial census serves as the base for population estimates, for the selection of samples for countless surveys, and for the weighting and adjustment of those samples to form estimates of the population.

Thus, the Census Bureau’s mission to “serve as the nation’s leading provider of quality data about its people and economy”1 puts it at the center of the nation’s statistical system. Its capacity to execute its mission, however, relies on a pact with the public that in exchange for providing information, the Census Bureau ensures that individual responses remain confidential. Any breach of this trust puts response at risk and potentially undermines the statistical health of the nation. The Census Bureau’s own research reveals that this trust has been fraying, leading to increasing concerns about privacy (McGeeney et al., 2019). And, it is why threats to census data confidentiality are taken so seriously and why debates of what it will take to maintain that pact with respondents are so important.

The current debate over how best to protect the identity of individual respondents is shaped by the Census Bureau’s interpretation of Section 9 of Title 13 of the U.S. Code, which was enacted in 1954. Thus, officials need to ensure that, as they process data for public release, the identity of any individual respondent is protected and cannot be inferred from a publicly released data file. They’ve been doing so successfully since the ban on such releases was put in place for economic data in 1909 and for population data in 1929. The proliferation of large, publicly available data sets and widespread advances in computer technology, however, have prompted users and producers of tabular information to evaluate whether their tried-and-true methods for processing data and protecting against disclosure need revision. The Census Bureau adopted new formal privacy methods for the processing of the 2020 Census, but the methods have triggered an intense debate in the data user community about how such efforts, while laudable in their intent, may be compromising the content and accuracy of the very data that is at the core of the Census Bureau’s mission. It is this debate that this article seeks to chronicle and inform.

In an effort to provide context for the current debate, Section 2 provides a short history of how the very meaning of confidentiality has evolved, from the early days of the Census Bureau as a permanent statistical agency, through the adoption of Title 13, and up to the present day. The focus is on how the Census Bureau has attempted to meet the challenges posed by the increased demand for data and the trust that is the basis for its collection. Section 3 provides a substantive chronology of how that debate has evolved over the past two decades, with the widespread availability of massive databases, increases in the technology available to mine data, and the increased attention to privacy. The focus is on how the Census Bureau’s newest methods work—in theory and in practice—to protect the privacy of individual respondents, and the challenges these methods present. Finally, Section 4 identifies the dilemmas posed by the current interpretation of Title13, as the Census Bureau attempts to fulfill its mission of providing quality data for the nation. Included here are a series of recommendations that would help the Census Bureau to move forward on its dual mandate of providing accurate data while maintaining the privacy of respondents.

2. Where We Were: Building on the Past

2.1. Disclosure Issues in the 20th Century

From its earliest days as a permanent agency of the federal government in the first decade of the 20th century, the Census Bureau understood the importance of trust between their requests for data and the public that supplied it. President William Howard Taft’s proclamation of March 15, 1910, which was meant to motivate participation in the 1910 Census, reassured the public that no harm would come to those who responded: “There need be no fear that any disclosure will be made regarding any individual person or his affairs.”2

While the Taft Proclamation stressed the privacy of individual census responses, the enabling legislation for the 1910 Census, passed in 1909, actually focused on the privacy of business/manufacturer establishments, especially given the antitrust investigations of that era.3 Indeed, Section 32 of the enabling legislation of 1909 permitted the Director of the Census, “at his discretion,” to “furnish such governor or court of record with certified copies of so much of the population or agricultural returns as may be requested.” A case in point occurred in 1917, as the nation prepared to enter the First World War. Concerns about the draft resulted in the release of individual census records, as a means of assessing the completeness of conscription efforts. The disclosure of individual census records, other federal agencies argued, must be conditioned on the needs of the nation. Thus, the Bureau director turned over records to those charged with implementing the draft.

As the 1920 Census approached, the 1919 enabling legislation repeated the confidentiality promise of Taft’s 1910 Proclamation, but still left the door open for the director to exercise discretion in the release of data to state and local officials and others for a variety of purposes, provided that no harm was done to individual respondents.4 A flurry of requests for individual census records followed the 1920 Census, which resulted in a steady erosion of confidence that such information would be kept confidential. Cognizant of the impact that this fraying of trust would have on their data collection activities—for population as well as economic data—the Census Bureau took steps to firm up the promise of confidentiality before the next decennial census, in the form of the Census Act of 1929. The 1929 Census Act helped restore trust in the confidentiality of census responses, with a strict prohibition on the publication of records, with a special emphasis on the population census that was about to take place. However, despite this, the director of the Census Bureau still had considerable discretion regarding the release of data, as long as such a release would not be to the detriment of respondents.5 Moreover, the idea that release would be contingent on matters of national importance was still part of the debates in that era.

The concern over national imperatives surfaced again with the passage of the 1942 War Powers Act. This legislation authorized the Secretary of Commerce to provide records to any agency of the government “for use in connection with the conduct of the war’’ (Second War Powers Act, 1942). This resulted in the release of census records for Japanese American internment after the 1941 Pearl Harbor attack. It was clear that national priorities, however defined, could upend the promise of confidentiality, which in turn threatened the trust that served as the basis for a good census enumeration.6

While the expiration of the War Powers Act in 1947 seemed to put the Bureau back on track in the confidentiality arena, public distrust was a major concern and a threat to response in the next census enumeration. The idea that confidentiality would be conditioned on national priorities continued to haunt the Bureau, in the form of requests for census records, which continued unabated after the Second World War. This was especially true for agencies at the federal level, such as the Department of Justice, the Federal Bureau of Investigation, and the Secret Service, all arguing that release of individual records was in the national interest.7

The Census Bureau pushed back hard after the war, but the battles were becoming ever more complex and visible, with threats to the confidentiality promise growing with each passing year. It was becoming obvious that confidentiality guarantees to census respondents needed to be strengthened. This led to the enactment of Title 13 in 1954, stressing that data in the census were to be used for statistical purposes only (using similar language to the 1929 Census Act). Section 8 stated that: “In no case shall information furnished under the authority of this section be used to the detriment of the persons to whom such information relates” (Public Law 740, Chapter 1158, p. 1013). Further, in Section 9:

Neither the Secretary, nor any other officer or employee of the Department of Commerce or bureau or agency thereof, may, except as provided in section 8 of this title— (1) use the information furnished under the provisions of this title for any purpose other than the statistical purposes for which it is supplied; or (2) make any publication whereby the data furnished by any particular establishment or individual under this title can be identified; or (3) permit anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports. (Public Law 740, Chapter 1158, pp. 1013–14.)

The ultimate responsibility for disclosure issues now resided with the Secretary of Commerce, and housing data was now explicitly included in this legislation.8 Since the enactment of Title 13, the Census Bureau has not allowed a single unauthorized release of individual census records (although census records can be provided to individual respondents themselves with their permission).

2.2. A Changing Data Environment

The conduct and content of the decennial census does not occur in a vacuum. When laws were enacted around the time the Census Bureau became a permanent agency of the federal government, the state of technology and what was the beginning of an organized statistical system were very different from those evolving by midcentury. Disclosure issues and controversies during the first half of the 20th century were all about the release of individual census records—think of reams of filing cabinets with paper census returns and mechanical punches used to compile these records for tabulation—but something else was changing in those early decades.

In the early decades of the Census Bureau’s existence as a permanent agency, the requests for more detailed data were increasing. Locally sponsored Census Bureau tabulation efforts occurred on the heels of the increased immigration from southern and eastern Europe. The many immigrants flocking to large cities gave rise to an increased demand for small area data to solve problems related to public health, the identification of new religious and political constituencies, and the business of selling newspapers (Salvo, 2012). With each successive decade, more and more detailed census data were released as a result of local efforts in cooperation with the Census Bureau until, in 1940, the bureau first delineated census tract data for major cities in the nation. The proliferation of detailed small area data combined with advances in computer technology were about to change the disclosure terrain forever.

Societal changes become incorporated into each census, such as a new question about household ownership of a radio in 1930, the expansion of content possible through the adoption of sampling in 1940, and widespread use of mailout/mailback methods in 1970. So too, changes in the census environment affect the potential for disclosure of individual information. Just as the large-scale adoption of mechanical machine tabulation methods early in the 20th century enabled efficient creation of the first detailed small area data tabulations, the adoption of computers in the 1960s made the development of intricate cross tabulations for small geographic areas easier to create and more accessible. These changes ushered in a new era that has been characterized by an almost insatiable demand for data to inform decisions at all levels of government and in the private and not-for-profit sectors of the nation.

So, while the general nature of the confidentiality pact has remained consistent since the passage of Title 13 in 1954, the actions to protect the data from disclosure have had to adjust to new environments, where individual disclosure is not solely a function of turning over census records for individuals or businesses, but where disclosure occurs based on deriving individual information from increasingly complex cross tabulations for small geographic areas.

2.3. Computers and Disclosure: The 1970s, 1980s, and 1990s

The computer era, which began in midcentury, took hold for most data users in the 1970s, where individual census records or microdata were now compiled and analyzed using the new power of computers, greatly increasing both the volume and complexity of tabulations that were made available. This new environment, in turn, created new issues for those who were tasked with preventing individual disclosure of census records; not the kind where individual census records were provided to another agency, but where such records could be derived with sufficient detail to identify individuals or small groups in detailed cross tabulations.

While different forms of suppression had been used prior to the age of electronic computers, usually through the use of thresholds for revealing data, it was the proliferation of both detailed cross tabulations, termed meso-data by Anderson and Seltzer (2007), and computer technology that ushered in this new era. Acknowledging the importance of this issue, the Subcommittee on Disclosure Limitation Methodology, formed by the Federal Committee on Statistical Methodology, in 1994 stated that:

The release of statistical data inevitably reveals some information about individual data subjects. Disclosure occurs when information that is meant to be treated as confidential is revealed. Sometimes disclosure can occur based on the released data alone; sometimes disclosure results from combination of the released data with publicly available information; and sometimes disclosure is possible only through combination of the released data with detailed external data sources that may or may not be available to the general public. At a minimum, each statistical agency must assure that the risk of disclosure from the released data alone is very low (Office of Management and Budget, 1994, Chapter 1, p 2)

Recognizing that computing technology and increasing content for small geographic areas represented a threat to the promise of confidentiality, the Census Bureau adopted techniques intended on suppressing data to prevent disclosure of individual information. This took the form of cell and table suppression in 1970, continuing with the 1980 Census products, where table cells or whole tables were suppressed when they did not meet specific thresholds of population or housing units (i.e., primary suppression) (Gates, 2012).

More controversial was the suppression of data that could be derived through subtraction (i.e., complementary suppression) (Sullivan, 1992). In many instances, tables that were heavily populated but where distributions were severely skewed resulted in the loss of data for an entire table. For example, a table for owner and renter housing units (housing ‘tenure’) would be wholly suppressed unless there were a certain minimum number of housing units in both categories to prevent subtraction of small table cells—a problem in parts of large cities where everyone rents and virtually no one owns apartments, or in places where everyone owns their own homes. Suffice it to say that by the late 1970s, the outcry from users about the loss of data, inability to correctly aggregate tables, and the inconsistencies across tables and groups associated with a system solely based on suppression caused the Bureau to rethink its approach (Zeisset, 1978).

In the 1980s, pressure mounted for alternatives that would blunt the impact of data suppression, as “Users found that the suppression techniques in the 1980 reports limited their ability to use the data” (U.S. Census Bureau, 1994, p. 10–3). For 1990, the suppression rules used in 1970 and 1980 were changed to accommodate what were some serious shortcomings of the primary and complementary suppression system. It is instructive to provide the three big reasons why the Census Bureau modified the 1980s system for the 1990 Census, two of which focused on the needs of data users (McKenna, 2018, p. 6):

  1. Dissatisfaction with the reduction in data tables caused by whole table suppression

  2. The lack of guidance for data users using the published data in the presence of suppression

  3. The disclosure risk issues caused by the lack of complementary suppression across geographic areas

Termed a ‘confidentiality edit,’ the 1990 system replaced suppression with rules-based data swapping, where data were exchanged at the individual record level for the 100% data. For sample data, a ‘blank and impute’ method was employed, where an estimate for a housing unit in a sample block was removed and replaced with an imputed value.

After selecting a small number of census households from internal Census Bureau files, data from these households were ‘swapped’ with households that had the same characteristics on certain key variables but were from different geographic locations. The key variables included number of people in the households by race/Hispanic origin by age group (under 18 and 18 and over), number of units in the building, rent/value, and tenure (owner/renter). The use of these variables to control the swapping of households ensured that total population and the population by key characteristics were not altered for small geographic areas. Census microdata were subjected to swapping prior to tabulation and information on which households were swapped and the degree of swapping were not publicly available (U.S. Census Bureau, 2003).

2.4. The 2000s: Progress on the Confidentiality Front?

The 1990s brought with it efforts to change the way respondents reported race. Extensive consultation with representatives of federal agencies and the data user community resulted in the adoption of revised race categories for federal surveys, based on the 1997 Office of Management and Budget (OMB) revision of Statistical Policy Directive No. 15 on classifying race and ethnicity. For the first time, respondents on federal surveys would be permitted to ‘‘check more than one race.’’ This created a possible 63 combinations of race categories, thus increasing the size of data files and of data cells with very small counts, further increasing the likelihood of disclosing individual responses, especially for small geographic units. The bureau used a modified approach to data swapping in the 2000 Census, focusing on cases in blocks with the highest risk of disclosure and in blocks with low imputation rates for full count tables (U.S. Census Bureau, 2003, p. 30).

During the late 1990s and around the period of the 2000 Census, the Census Bureau conducted a number of survey-based research studies regarding attitudes about the confidentiality of data and trust in the census. They found heightened concerns about the collection of data by the government and the private sector. And, the new millennium brought with it more sophisticated data technology, resulting in increases in apprehension about the collection of data. The heightened concern on the part of the public was increasingly manifested in across-the-board declines in survey response rates, resulting in increased costs and threats to data quality (Czajka & Beyler, 2016; National Research Council, 2013). The bureau took steps to increase response to the census, using a major paid advertising campaign for the first time in the 2000 Census. In addition, the bureau reinforced messages about respondent confidentiality in the form of a set of principles to supplement Title 13 protections and further ensure that collected data remain confidential (U.S. Census Bureau, 2003).

Two events in the first decade of the new millennium further increased the spotlight on issues surrounding confidentiality of federal data and on privacy. The first had to do with the continued reevaluation of the Census Bureau’s role in the internment of Japanese-Americans in the early 1940s. In 2000, Kenneth Prewitt, the director of the Census Bureau, acknowledged the role of the Census Bureau in advancing the cause of Japanese internment, and joined President Bill Clinton in apologizing for this sad event in American history (U.S. Census Bureau, 2003). However, it was not until 2007 that, upon the revelation of further evidence by Anderson and Seltzer, that the Bureau formally acknowledged the proactive nature of its role related to the release of microdata for individuals of Japanese descent (Anderson & Seltzer, 2007).

The second occurred in the wake of the September 11, 2001 attacks, with requests from federal agencies for data on Arab Americans (El-Badry & Swanson, 2007). While ostensibly not a violation of the bureau’s mission, the provision of small area aggregate data for Americans of Arab descent raised the specter of government surveillance, not unlike that which occurred in 1942 with Japanese Americans. Thus, while the Bureau’s leadership saw the act of providing what were essentially publicly available data for a specific group as being consistent with its mission, the controversy that ensued caused the bureau to reevaluate its mission to provide data for the nation while protecting the confidentiality of respondents (Anderson, 2020). Once again, the issue of trust in the Census Bureau was at risk, the very trust that serves as the basis for response to the census and myriad surveys they conduct.

3. The Post-2010 Environment and a Change in Course for the 2020 Census

The 2010 Census utilized data swapping as the backbone of its Disclosure Avoidance System (DAS). While thought to provide some confidence that individual responses could not be identified with certainty, advances in computer science, better record linkage technology, and the proliferation of large public data sets have increased the risk of disclosing information about individuals in the census and caused concern about the adequacy of the swapping approach. It is hard to put an exact timestamp on when the Census Bureau began to debate the merits of these new disclosure prevention methods for the 2020 Census, but it was in September of 2016 that John Abowd, the bureau’s chief scientist, presented a case for a new approach to protecting the privacy of respondents to the Census Scientific Advisory Committee (CSAC) (Abowd, 2016). Discussions with CSAC continued in 2017 with an update provided by Simson Garfinkel, Senior Computer Scientist for Confidentiality and Data Access at the US Census Bureau. At this point, the presentation was far more direct, indicating that the plan was to implement disclosure avoidance based on differential privacy for the 2018 Census Test and, ultimately, for the 2020 Census (Garfinkel, 2017).

It was not until 2018, however, that the Census Bureau conducted a simulated ‘attack,’ reconstructing person-level records from published 2010 Census tabulations using its previous DAS, which was based on swapping data between blocks for households that had similar characteristics. When combined with information in commercial and publicly available databases, these reconstructed data suggested that 17% of the U.S. population could be identified with a high level of certainty using published Census Bureau tabulations.9 The Census Bureau concluded that, if adopted for 2020, the 2010 confidentiality measures would lead to a high risk of disclosing individual responses, violating Title 13 of the U.S. Code. The bureau was also concerned about the implications the new DAS would have for census content—not related to the data for apportionment or redistricting, which was largely mandated—but for the whole array of detailed tabulations published from the 2010 Census and anticipated to be published from 2020. In an effort to learn more about the content that would be required for 2020, the bureau issued a Federal Register Notice in July of 2018, “Soliciting Feedback from Users on 2020 Census Data Products.” In that Notice, they stated that:

Given the need for improved confidentiality protection, we may reduce the amount of detailed data that we release to the public. Public feedback is essential for a complete review of the decennial census data products [which] will assist the Census Bureau in prioritizing products for the 2020 Census. (Soliciting Feedback, 2018)

The request by the bureau was for users to provide information on the need for every single table originally proposed for the 2020 Census. Users were asked about the legal, statutory, and programmatic uses of each data item, along with a request regarding the amount of funding that was distributed based on the data and the level of geography required for the items. Absent from this request was any real explanation of the reason for this exercise, outside of a general statement about improving confidentiality protection. This request engendered a sense of bewilderment on the part of data users and triggered a litany of concerns about 2020 Census content that was clearly at risk.

In December of 2018, the Census Bureau released Version 4.0 of its 2020 Census Operational Plan that now included the new disclosure avoidance method, which was still under development and was based on the concept of differential privacy (DP) (U.S. Census Bureau, 2018, pp. 139–140).

In February 2019, the then-deputy director of the U.S. Census Bureau, Ron Jarmin, publicly announced that the bureau would use algorithms that satisfied the DP criterion (defined in Section 3.1 below) as the basis of its DAS for the 2020 Census, in order “to preserve the confidentiality of the 2020 Census” public release data products.10 In that same month, at a meeting of the American Academy for the Advancement of Science (AAAS) in Washington, D.C., Abowd, who as chief scientist at the Bureau and principal architect of the new DAS, made a presentation that included members of the press, where the case was made for the new system based on differential privacy, as part of a data disclosure panel.11

3.1. The Theoretical Foundations of Differential Privacy

The Bureau’s new DAS is based on a set of probabilistic algorithms that produces releasable statistics from confidential data that satisfy a predetermined bound known as the differential privacy criterion. More formally, let XX denote a confidential data set for n individuals and let s(X)s(X) denote a statistic calculated with the confidential data. To protect the disclosure of information about an individual i who is included in the confidential data, XX, randomized versions of s(X)s(X), produced by a mechanism M(X)M(X), are released to the public rather than s(X)s(X). The mechanisms used are designed to meet the DP criterion. A mechanism, MM, is said to be ‘ε-differentially private’ if it satisfies the following bound,

Pr(M(X)B)Pr(M(X)B)eε                (1)\normalsize \frac{Pr\left( M(X) \in B \right)}{Pr\left( M(X') \in B \right)} \leq e^{\varepsilon} \ \ \quad \quad \quad \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

for any confidential data set, XX', that differs from X by exactly one observation, where ε\varepsilon, ε>0\varepsilon > 0. A commonly-used mechanism, MM, that is ε-differentially private adds ‘random noise’ to s(X)s(X) to produce M(X), that is,

M(X)=s(X)+U                (2)M(X) = s(X) + U \ \ \quad \quad \quad \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

where U is drawn from the Laplace distribution that has mean-zero, is independently distributed, and has a variance that is proportional to the inverse of the ε.12 While other mechanisms exist that also are ε-differentially private, we focus on the one in (2) in this article to fix ideas.

In words, the ε-differentially private mechanism in (2) that satisfies (1) produces released statistics—M(X)M(X) for the confidential data set XX and M(X)M(X') for the data set XX'—that relatively differ probabilistically by no more than eεe^{\varepsilon}. Data providers, such as the Census Bureau, set ε\varepsilon. Setting ε\varepsilon to larger values when generating the statistics M(X)M(X) and M(X)M(X') makes it ‘relatively easier’ to distinguish individual i from the statistics, M(X)M(X) and M(X)M(X'), since the released statistics are ‘closer’ to true statistics s(X)s(X) and s(X)s(X'), since ‘less noise’ is added to them, that is, the U’s used to form the released statistics (M(X)) are drawn from a distribution that have a lower variance. In contrast, setting ε\varepsilon to smaller values infuses ‘more noise’ in the released statistics, making them less similar to the confidential statistics, s(X)s(X) and s(X)s(X'), thereby making it harder to identify individual i from the released statistics. But smaller values of ε imply that the released statistics, M(X)M(X), will be less accurate, that is, they are more likely to differ from s(X)s(X) or s(X)s(X'), respectively. As a result, there is a fundamental tradeoff between the privacy protection and accuracy in using a differentially private mechanism for the DAS. However, in contrast to other disclosure avoidance approaches, using mechanisms that satisfy the DP criterion allow data providers to control this trade-off, in principle, by their choice of ε, that is, their choice of a bound, or budget, for the allowed privacy loss defined in (1).

The DP criterion and the associated mechanisms defined in (2) was first proposed by Dwork et al. (2006) and grew out of earlier work by computer scientists working in the area of cryptography. Furthermore, this approach is transparent in the sense that data providers can disclose the exact algorithms used in a DAS. Finally, this approach is not premised on particular assumptions about what data potential intruders seeking to identify individuals (and their private data) possess from data that is released to the public. For all of these reasons, adopting a DAS based on the DP criterion was an attractive approach for protecting the data to be released from the 2020 Census, especially in light of the concerns raised by the reconstruction and reidentification attacks that the Census Bureau simulated on the publicly available data from the 2010 Census. However, as we discuss in Section 3.6, mechanisms satisfying the DP criterion place bounds on a particular form of disclosure risk that may or may not be consistent with the requirements of Title 13 for protecting the privacy and confidentiality of data released by the Census Bureau.

In configuring the DAS for the 2020 Census, the key policy question is ‘where to set the dial,’ that is, how high to set ε. Just as important as the overall level of ε is its allocation over the content and detail of the census tabulations, the distribution of which is what is called the privacy loss budget (PLB) (see Dwork and Roth, 2014; Groshen & Goroff, 2022; Vadhan, 2017; Wood et al., 2018). This question of how much privacy loss is acceptable has fueled a spirited debate about how the Census Bureau can best fulfill its mission of providing quality data for the nation, while protecting the confidentiality of census responses that is the basis for trust that undergirds its data collection activities.

With their decision having been made to develop a DP-based DAS for the data to be released from the 2020 Census, the Bureau began developing what it referred to as the TopDown Algorithm (TDA) (Abowd et al., 2022) to “preserve the utility of our [the Bureau’s] legally mandated data products while also ensuring that every respondent’s personal information is fully protected.” The TDA implemented a particular differentially private mechanism for constructing privacy protected tabulations but also constrained these tabulations to preserve certain counts, such as having state- and national-level counts of individuals’ equal population totals determined by the census. Furthermore, it embarked on a 2-year engagement with the scientific community and the census data user community to optimize and assess the adequacy of this algorithm for the range of ‘use cases’ to which decennial census data are devoted. In June of 2019, a Committee on National Statistics (CNSTAT) workshop was held in Washington, DC, to unveil its approach, but it was not until October of 2019 when the Census Bureau issued its first ‘demonstration file,’ that this engagement with user communities began. We summarize that process of engagement in the next section.

3.2. Evaluating the DP-based DAS for the 2020 Census: 2010 Demonstration Products and Interactions With User Communities

To test how well the current DAS methodology works in terms of the accuracy of noise-infused data, the Census Bureau issued special 2010 Census files subject to the 2020 DAS. These ‘demonstration files’ applied the 2020 Census DAS to the 2010 Census confidential data, that is, the unprotected data from the 2010 Census that are not publicly available.13 The demonstration data permitted scientific inquiry into the impact of the new DAS, including comparisons of tabulations from the 2010 demonstration data and publicly released tabulations from the 2010 Census. In addition, the census commissioned the CNSTAT of the National Academies of Sciences, Engineering, and Medicine to host a 2-day Workshop, “2020 Census Data Products: Data Needs and Privacy Considerations,” in Washington, D.C., on December 11–12, 2019.14 The two-fold purpose of the workshop was to: 1) assess the utility of the tabulations in the 2010 Demonstration Product for specific use cases/real-life data applications and 2) generate constructive feedback for the Census Bureau that would be useful in setting the ultimate privacy loss budget and on the allocation of shares of that budget over the broad array of possible tables and geographic levels.

The workshop brought together a diverse group of researchers who presented findings for a wide range of use cases that relied on data from past censuses. ​These presentations, and the discussions surrounding them, provided an assessment of the potential consequences of the Census Bureau’s new DAS for a variety of uses. A summary of key observations:

(a.) Population counts for some geographic units and demographic characteristics were not adversely affected by differential privacy. Based on results presented at the workshop, it appeared that there were not, in general, differences in population counts between the 2010 demonstration file at some levels of geography, such as states (where counts were invariant), and for geographic areas with direct allocations of the privacy budget (most counties, metro areas, and census tracts).

(b.) Concerns with data for small geographic areas and population groups. At the same time, evidence presented at the workshop indicated that most data for small geographic areas—especially census blocks—are not usable given the privacy-loss level used to produce the demonstration file. With some exceptions, applications demonstrated that the variability of small area data (i.e., blocks, block groups, census tracts) compromised existing analyses.

(c.) The absence of a direct allocation of privacy-loss budget for political and administrative geographic areas, such as places and county subdivisions, or to detailed race groups, such as American Indians. Numerous presenters demonstrated how these places and groups are very important for many use cases, as they are the basis for political, legal, and administrative decision-making. Many of these cases involve small populations and local officials rely on the census as a key benchmark; in many cases, it defines who they are.

(d.) Problems for temporal consistency of population counts. Several presentations highlighted the problem of temporal inconsistency of counts, that is, from one census to the next using DP.

(e.) Unexpected issues with the postprocessing of the proposed DAS. The TopDown Algorithm employed by the Census Bureau in constructing the 2010 demonstration data produced histograms at different levels of geography that are, by design, unbiased—but they are not integers and include negative counts. The postprocessing required to produce a microdata file capable of generating tabulations of persons and housing units with nonnegative integer counts produced biases that are responsible for many anomalies observed in the tabulations.

(f.) Difficulties estimating error. The application of DP to raw census data (the Census Edited File [CEF]) produces estimates that can be used to model error, but the postprocessing adds a layer of complexity that may be very difficult to model, making the creation of ‘confidence intervals’ problematic.

(g.) The importance of protecting privacy. Challenging privacy concerns and their potential consequences for the success of the 2020 Census. There are growing disclosure risks associated with the ability to link information in public data sources, like the decennial census, with commercial databases containing information on bankruptcies and credit card debt, driver licenses, and federal, state and local government databases on criminal offenses, public housing, and even citizenship status. While there are federal and state laws in place to prohibit the misuse of these governmental databases as well as the census (i.e., Title 13), their adequacy is challenged by advances in data linkage technologies and algorithms. And, these potential disclosure risks may well undercut the willingness of members of various groups—including immigrants (whether citizens or not), individuals violating public housing codes, or those at risk of domestic violence—to participate in the 2020 Census.

Subsequent to the December 2019 CNSTAT Workshop, the response on the part of advocacy organizations was swift and to the point. The Census Bureau’s disclosure avoidance system, they claimed, introduced biases that led to inaccurate counts for minority communities (Asian Americans Advancing Justice-Mexican American Legal Defense and Educational Fund, 2020; National Congress of American Indians, 2020; Native Hawaiian-Serving Organizations, 2020). The Census Bureau responded that the DAS needed to be modified or ‘tuned’ to better optimize the balance between confidentiality and accuracy.

In order to provide the data user community with the ability to track these improvements, they issued a set of proposed metrics in the spring of 2020—summary statistics that focused on accuracy, bias, and data outliers (Devine et al., 2020). This was seen by many in the data-user community as a necessary but insufficient proposal. State and local governments and a panoply of professional organizations all reacted to the proposal by requesting that the Census Bureau provide new demonstration data files, like those provided in October 2019 for the December CNSTAT workshop (Association of Public Data Users, 2020; Federal-State Cooperative Program for Population Estimates, 2020; National Conference of State Legislators, 2020; Population Association of America, 2020).

Not forgotten in this debate over the balance between confidentiality and accuracy is the trust that is engendered by the Bureau’s responsibility to abide by Title 13.15 This was reflected in communications by advocates with the Census Bureau about the importance of balancing the need for accuracy with the legal mandate to protect the confidentiality of respondents (Minnis et al., 2020). This point was also driven home in commentaries by census watchers, some of whom expressed concern about a failure of the Census Bureau to properly communicate with and educate the data user community about the reasons for new privacy protection approaches (boyd, 2020, pp. 27–31).

Throughout 2020 and into 2021, the Census Bureau narrowed its focus of ‘fitness for use,’ by ‘tuning’ the algorithm almost exclusively for redistricting applications using data from the PL 94-171 file, as defined by consultation with the U.S. Department of Justice for enforcement of the Voting Rights Act of 1965 (U.S. Department of Justice, 2021). Successive releases of demonstration files brought small incremental improvements in the data, mostly as a result of changes in the allocation of the privacy loss budget.

It was not until the spring of 2021, however, that major improvements in the DAS were seen, largely as a result of a big increase in the privacy-loss budget, with a goal of optimizing the data for use in redistricting. Using the 2010 Census data, the bureau announced the adoption of an ‘accuracy target’ that optimized the DAS for areas of 500 persons or more, which would provide estimates that were within 5 percentage points (of the 2010 published counts) 95% of the time for the largest race group.16 In addition, methods were improved for the allocation of the increased PLB to legal, administrative, and political geographic areas, which was a big problem with earlier demonstration files.

The final optimized DAS with an increased privacy loss budget was adopted by the Census Bureau’s Data Stewardship Executive Policy (DSEP) committee for the 2020 Census PL 94-171 redistricting file, which was first issued in August of 2021.

3.3. Assessments of 2010 Demonstration Data Products for Redistricting Purposes

While the final decision by DSEP brought big improvements to the quality of counts over earlier releases of the data, a debate ensued among researchers who study redistricting and a number of redistricting practitioners as to whether DP has an impact on the creation of districts aimed at representing minority voters.

For example, Kenny et al. (2021) used the April 2021 version of the 2010 Census demonstration data to examine both the properties of these data for purposes of redistricting and to simulate alternative redistricting maps that conform to commonly used criteria for legally acceptable redistricting practices. They find systematic biases in these data along racial and partisan lines that reduce the heterogeneity by geography in both of these dimensions. Based on their simulations, they find that using the DP-based DAS made it difficult to produce districts of equal population size, produced too few precincts that were heterogeneous along racial and partisan lines, underpredicted minority voters, and produced too few majority-minority districts (MMDs)17 that are used to judge whether a particular redistricting map conforms with the Voting Rights Act.

Furthermore, Kenny et al. (2021) examined the final version of the 2010 demonstration data released on August 12, 2021, that was produced using the version of the Census’s TDA that formed the basis of the DAS for the 2020 Census redistricting data, including its greater privacy loss budget. These authors found that while this final demonstration product was an improvement over previous releases, it still produced biases in drawing and simulation of voting districts (VTDs). They conclude that these biases “likely come from the decision to maintain accuracy at geographies other than VTDs and voting precincts, such as census tracts” (Kenny et al., 2021, p. 14).

Other assessments of the suitability of data produced by the bureau’s DP-based DAS for redistricting drew more favorable conclusions. Cohen et al. (2021) use a simplified version of the bureau’s TopDown Algorithm, which they labeled the “ToyDown” Algorithm, and applied it to block-level 2010 Census data for the state of Texas to assess how the application of this algorithm affects key issues for the production of voting districts, including: how different allocations of the privacy-loss budget across levels of geography affect district formation, how construction of districts meeting criteria such as compactness are affected, and how it affects (linear regression) methods used to detect racial polarization in voting rights cases. Overall, they find that any distortions introduced by this simplified algorithm and more complex versions do not appear to distort their simulated redistricting exercises.

Such analyses have and are being used to form the bases for legal challenges to the legality and legitimacy of these methods for application to census data. One such suit, State of Alabama v. Department of Commerce, sought to prohibit the Census Bureau from delaying the release of the 2020 PL 94-171 data past the congressionally mandated date of March 31, 2021. 18 In addition, the plaintiffs took issue with the Census Bureau’s interpretation of Title 13, namely, that it necessitates the use of differential privacy as the mechanism to protect the confidentiality of census responses. In addition to arguing that this contention is false, they posited that the application of differential privacy amounts to manipulation of the data used for redistricting purposes, a situation that will bring “significant harm to Alabama.”19 Finally, the plaintiffs argued that “the Bureau did not provide notice in the Federal Register of its decision to adopt differential privacy for the 2020 census. Nor did it otherwise seek public comment before the decision was made.”20

After the court allowed for a judgement based on a three-judge panel in March of 2021, oral arguments were heard in May as to whether the court should issue a preliminary injunction against release of the data with differential privacy applied. Moreover, this injunction was very time sensitive, according to the plaintiffs, because the harm caused by the release of redistricting data with differential privacy could not be undone, given that the release of districting plans were already delayed because the Census Bureau could not meet its original March 2021 release date.

The U.S. District Court ultimately ruled against the plaintiffs in June of 2021 on the grounds that the case lacked merit based on “ripeness,” or the ability of the plaintiffs to demonstrate harm. Moreover, by the plaintiff’s own admission, the issue became moot when the PL94-171 redistricting file was issued in August of 2021.

The issue of data disclosure has emerged in another lawsuit, brought by the Fair Lines America Foundation in May of 2021. While there were a number of requests made by the plaintiffs in this case, the ultimate focus was on a request made of the Census Bureau to provide state totals on the number of persons in group quarters who were imputed.21 The plaintiffs were trying to assess the impact of the pandemic on state population counts, given the difficulties encountered by the Census Bureau in the enumeration, especially in nursing homes and college dormitories. Release of such data, the Census Bureau argued, would constitute a violation of Title 13, since the plaintiffs were requesting internal data that would not be subject to differential privacy, adding to the number of released items that were invariant (i.e., not subject to disclosure avoidance algorithms).22 Experts for the plaintiffs, however, argued that such a release constitutes a negligible loss of privacy, because the data are at the state level. As of March 1, 2022, the case was still pending.

3.5. Assessments of the 2010 Census Demonstration Data for Other Use Cases

As issues involving data disclosure and reapportionment and redistricting work their way through the courts, the Census Bureau has been engaged in discussions with a number of state, local, and tribal governments, and advocacy organizations, focused primarily on assessing the accuracy of data produced with the TopDown Algorithm. Applications of the data for evaluating racial equity in voting, public health, emergency preparedness, and disaster mitigation, just to name a few, have become the subject of debates regarding the utility of the data that the Bureau produces.

One such line of research analyzed how differences between using population counts for different racial and ethnic groups, especially at finer levels of geography, would affect estimates of mortality rates. Census data has traditionally been used to form the ‘denominators’ in the calculation of group-specific mortality rates. Santos-Lozada et al. (2020) found sizable differences in these group-specific mortality rates for all but non-Hispanic Whites when using the 2010 Census public release data versus data from the early 2010 Census demonstration products. Similarly, Hauer and Santos-Lozada (2021) found that estimates of COVID-19 mortality rates also were distorted using data from the early 2010 Census demonstration products. In contrast, Krieger et al. (2021) assess the differences in using the 2010 Census public release data and 2010 Census demonstration data for measures of inequities across census tracts in premature mortality (deaths before age 65) in the state of Massachusetts for the period 2008–2012 and found no evidence of across-tract differences in mortality using these two sources to calculate denominators.

All of these studies used early versions of the 2010 demonstration data, where data were available by age groups, as well as smaller values of the privacy-loss budget than used in the final release of the data. The more recent releases of demonstration files have been based on the PL 94-171 redistricting data, where data were released for major race groups by just two categories of age: under 18 and 18 and over, including the more recent demonstration data that were based on larger privacy-loss budgets. Finally, none of the released demonstration data included information about household and family structure. Thus, it has not been possible to evaluate the impact of a large PLB on the utility of estimates for many critical applications that hinge upon the success of efforts to merge household and person data, such as information on families with own children and households with persons 60 years of age and over. The current debate is over the content of the Demographic and Housing Characteristics (DHC), which will include a much larger number of age categories for individuals, characteristics of their households, as well as the housing units in which they reside.23 The Census Bureau released a 2010 demonstration file with this more detailed content in April 2022,24 which the data user community is currently evaluating.

The content of the DHC, currently under consideration, is much more varied than the PL 94-171 files used for redistricting and the applications so broad as to preclude the development of any single target criterion for the DAS to satisfy. This will be especially difficult, given the more elaborate and complex nature of many DHC tabulations, such as those with iterations by race and Hispanic origin for smaller geographic areas. Given the wide variation in the content and applications of DHC, the Bureau faces a dilemma in the form of being unable to create a useful ‘target’ that will determine the parameters for a DAS.

3.6. Differential Privacy and the 2020 Census: Theory vs. Practice

In contrast to the previous disclosure avoidance approaches used by the Census Bureau, the differential privacy criterion provides a rigorous definition of privacy, computational algorithms for producing privatized data that satisfy it, and a body of research, developed by computer scientists, that establish coherent theoretical foundations for the approach and a provable guarantee of privacy against a range of privacy attacks, that is, strategies for determining the identities of individuals and their characteristics.25

Given the theoretical foundations of the DP criterion and the algorithms that implement it, it is important to understand that particular implementations ‘in practice’ may not always satisfy the conditions that justify these foundations. This is especially true in the case of implementing the DAS for the 2020 Census. In this section, we outline some of dimensions of this implementation where theory and practice may diverge.

Divergence Between the Differential Privacy Criterion and Title 13.

We begin with a foundational issue concerning the relationship between the DP criterion and the legal obligations that Census has to protect the privacy and confidentiality of individuals’ information when releasing data. In particular, data releases by the U.S. Census Bureau are governed by Title 13 of the U.S. Code. Title 13 provides individuals and businesses with a range of protections related to the information that the Bureau collects and publishes. In particular, Title 13: (a) prohibits the release of data to the public that allows any individual or business to be identified; (b) restricts the uses of data the Bureau collects to only those for statistical purposes; that is, collected information cannot be used against respondents by any governmental agency or court, and prohibits all Census Bureau employees or its designated agencies from disclosing private information under penalty of law.

Provision (a) is the most relevant for purposes of designing a DAS for the public release of census data. There appear to be two aspects associated with the implementation of this provision: (i) whether it allows for any chance, or probability, of disclosure risk when releasing data and (ii) how to characterize that risk. As the theoretical literature on differential privacy noted above makes clear, creating data with zero risk of disclosure is impossible, especially given the presence of external information that may be used by intruders seeking to identify an individual, and would make it very inaccurate from a usability point of view. Thus, the practical objective of Title 13 and other privacy protection guarantees focuses on minimizing, or bounding, the probability of disclosing information that can be used to identify individuals.

The second issue concerns how to characterize the risk of disclosing any individual’s identity with the data that the Census Bureau releases on individuals that Title 13 seeks to protect. In a series of papers, George Duncan and Diane Lambert developed a framework for measuring the absolute risk of disclosure as the probability that an ‘intruder’ can identify individuals in a data set, based on the released data, what intruders know about individuals in the absence of such data, and what they know about the agency’s disclosure limitation strategy (Duncan & Lambert, 1986, 1989; Lambert, 1993).

More formally, let t be the data in the intruder’s possession (e.g., including possibly names and addresses as well as other variables). Then the absolute risk of disclosing the identity of individual j and her information from the release of the statistic or data set M(X)M(X) is defined as Pr(J=j,Xjt,M(X))\Pr\left(\left.J = j,X_{j} \right|t,M(X) \right), that is, the probability that the intruder matches individual J to the jth person in the released data set and determines the value of their confidential data, XjX_{j}, conditional on the intruder’s data set (t) and the released data or statistics, M(X), for the confidential data set X. Title 13 seems to imply that the goal of an agency’s DAS is to keep this disclosure probability low by releasing a data set Z that makes it hard for the intruder to combine it with their data set, t, to learn the identity of a given person. Duncan and Lambert and other statisticians noted that by Bayes’ theorem, this disclosure probability is a posterior probability equal to:

Pr(J=j,Xjt,M(X)B)=[Pr(M(X)Bt,J=j,Xj)Pr(M(X)Bt)]Pr(J=j,Xjt)           (3)\Pr\left( \left. J = j{,X}_{j} \right|t, M(X) \in B \right) \\ \enspace = \left\lbrack \frac{\Pr\left( \left. M(X) \in B \right|t,J = j,X_{j} \right)}{Pr(\left. M(X) \in B \right|t)} \right\rbrack Pr(\left. J = j,X_{j} \right|t) \ \ \ \ \ \ \ \ \ \ \ (3)


where the expression in brackets is the ratio of the probabilities of the released statistic, M(X)M(X), that conditions on knowledge of individual jj being in the data (J=jJ = j) and the intruder’s information (t) relative to only conditioning on the intruder’s information, and Pr(J=j,Xjt)Pr(\left. J = j,X_{j} \right|t) is the prior probability, or risk, that the intruder can identify a person of interest and the value of their data using only the intruder’s data t. Statistical agencies cannot alter the prior risk of disclosure, Pr(J=j,Xjt)Pr(\left. J = j,X_{j} \right|t), but they can alter the term in brackets by releasing statistics M(X)M(X) that attempt to reduce the absolute risk of disclosure, Pr(J=j,Xjt,M(X)B)\Pr\left( \left. J = j,X_{j} \right|t,M(X) \in B \right), for each individual j in the data. The goal of agencies’ disclosure avoidance systems is to use mechanisms to produce M(X)M(X) that reduce this term in (3).

The above expression highlights the important fact that the risk of identifying individuals is always present, even if an agency releases no data at all, since the intruder may have information, t, that can be used to increase the chances of identification. Statistician Jerry Reiter and his coauthors (McClure and Reiter, 2012, 2016; Reiter, 2005) have extensively analyzed the measure of disclosure risk in (3) and evaluated how alternative approaches to disclosure avoidance, including those based on differential privacy, reduce this risk as a function of prior risk.

As noted above, the DP criterion places a bound on a form of disclosure risk, namely, on the probability of being able to detect whether or not any individual is included in the data. One of the benefits of focusing on this criterion for disclosure risk is that it does not require data stewards, such as the Census Bureau, to assess or determine the disclosure risks associated with any potential external information that might be used in a reidentification attack with released data. Put differently, the theoretical foundations of the DP approach for placing quantifiable bounds on this form of disclosure risk is not premised on particular assumptions about what intruders know or don’t know about individuals and their private data or what they might learn about individuals in the future.

An important issue is whether and what bounds using a DAS that meets the DP criterion places on the absolute disclosure risk considered by Duncan and Lambert and that characterizes the privacy protections under Title 13. In a recent paper, Gong and Meng (2020), establish the link between the DP criterion and the absolute disclosure risk of statistics released that meet the criterion. In particular, Gong and Meng (2020) show that the DP criterion defined in (1) implies that the probability of disclosure risk from releasing the statistic, Mε(X)M_{\varepsilon}(X), produced by a ε-differentially private mechanism defined in (2) has the following bounds:

Pr( J=j,Xjt)eεPr( J=j,Xjt,Mε(X)B)eεPr( J=j,Xjt)    (4)\Pr\left( \left. \ J = j,X_{j} \right|t \right) \bullet e^{- \varepsilon} \leq Pr\left( \left. \ J = j,X_{j} \right|t,M_{\varepsilon}(X) \in B \right) \leq e^{\varepsilon} \bullet \Pr\left( \left. \ J = j,X_{j} \right|t \right) \ \ \ \ \text{(4)}

or equivalently:

eεPr(J=j,Xjt,Mε(X)B)Pr(J=j,Xjt)eε.           (4)e^{- \varepsilon} \leq \frac{\Pr\left( \left. J = j,X_{j} \right|t,M_{\varepsilon}(X) \in B \right)}{\Pr\left( \left. J = j,X_{j} \right|t \right)} \leq e^{\varepsilon}. { \ \ \quad \quad \quad\quad\quad \ \ \ \ \ \ \ \ \ \text {(4}'\text {)} }

The representation in (4) states that the bound, especially the upper bound, on the probability of the absolute disclosure risk is a function of the ε-differentially privacy bound, that is, eεe^{\varepsilon} given in (2), but that it also depends on Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right), the disclosure risk in absence of the released data. Put differently, the representation in (4') states that ε-Differential Privacy places a bound on the incremental disclosure risk of the release of the statistic Mε(X)M_{\varepsilon}(X) over and above the prior disclosure risk, Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right).26 As Gong and Meng note, these representations provide “a direct link between the differential privacy promise and the actual posterior risk of disclosure due to the release of the random query MεM_{\varepsilon}” (Gong & Meng, 2020, p. 61) and that “differential privacy is … about controlling the additional disclosure risk from releasing the privatized data to the users (or hackers), relative to what they know before the release” (p. 62).

Because this prior disclosure risk affects the bounds that reliance on meeting the DP criterion imposes on Pr(J=j,Xjt,Mε(X)B)\Pr\left(\left. J = j,X_{j} \right|t,M_{\varepsilon}(X) \in B \right), higher values of Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right) necessarily imply less stringent upper bounds on Pr(J=j,Xjt,Mε(X)B)\Pr\left( \left. J = j,X_{j} \right|t,M_{\varepsilon}(X) \in B \right). To illustrate, suppose that the prior risk of disclosure, Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right), were as large as 0.5. Then the release of the statistic, Mε(X)M_{\varepsilon}(X), using a fairly ‘protective’ privacy-loss budget of ε = .07 or greater, would effectively place no bound on the absolute risk of disclosure defined in (2), since eεPr(J=j,Xjt)e^{\varepsilon} \cdot \Pr\left( \left. J = j,X_{j} \right|t \right) ≈ 1.0. Even for a more reasonable level of prior risk—say, for example, Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right) = 0.001—the absolute risk of disclosure from the release of Mε(X)M_{\varepsilon}(X) with ε = .07 is 0.002, so that the absolute disclosure risk would be twice as high as no release at all.

The result from Gong and Meng raises a rather fundamental issue concerning the adequacy of a DAS designed to satisfy the DP criterion for meeting the Title 13 privacy protections. As noted earlier, Title 13 prohibits the release of data to the public that allows any individual or business to be identified. Allowing that any data release entails some risk of disclosing the individual’s identity, the Census Bureau remains obligated by this title to control the probability of this risk, that is to control Pr(J=j,Xjt,Mε(X)B)\Pr({J} = j,X_jt,M_{\varepsilon}(X) \in B). But, as noted by Duncan and Lambert, this risk depends not only on what safeguards the Census Bureau’s DAS does to control this risk, but it also depends on the risks of disclosure that may exist in the absence of any released data due to information that is externally available on individuals, that is, t, that can be combined with released data to identify individuals that are in the census.

Does Meeting the Differential Privacy Criterion Necessarily Limit Absolute Disclosure Risk? Potential Concerns.

In one sense, the result from Gong and Meng and the illustrative calculations in the previous section may seem somewhat abstract and hypothetical. Even though the DP criterion does not place a bound on absolute disclosure risk, nonetheless, it may limit this form of risk. However, the recent evidence presented in Kenny et al. (2021) suggests that this may not the case.

The authors of this study examined a version of a reidentification attack with the 2010 Census demonstration data that was produced using the version of the TDA used to produce the PL 94-171 data from the 2020 Census that was released in August 2021 (this version of the TDA used an overall ε of 19.6). These authors built a model that utilized geographic information from the 2010 demonstration data together with the names of voters from voter registration data for a selected set of states, to predict individuals’ race and ethnicity in those states. Because the voting registration data for these states contained voters’ race and ethnicity, they were able to assess how well their model predicted individuals’ race/ethnicity.

These authors were able to predict accurately the race/ethnicity of approximately 90% of voters in these states. Furthermore, when they did not include the geographic information from the 2010 demonstration data, but only used voters’ names, the accuracy of their predictions fell to slightly less than 80% (as noted in Imai & Khanna, 2016). To be clear, this finding does not imply a violation of the DP criterion, since, as noted above, it places a bound on the incremental risk of disclosure from the release of data like the 2010 Census. However, this evidence is suggestive that a ‘reidentification attack’ that combines publicly available voter registration data with data produced with the DP-based DAS used by the Census Bureau of voters is fairly successful in the sort of reidentification attack this DAS was designed to limit. To be clear, there are other explanations for this finding in the Kenny et al. (2021) study. Nevertheless, it does raise serious concerns about whether a DP-based DAS adequately protects the privacy of individuals’ identities in the census that Title 13 mandates.

A second concern with the bureau’s TopDown Algorithm is related to its postprocessing features. The standard implementation of DP to counts data would be to infuse randomly generated noise into the various counts to be released to the public, where the noise infusion is calibrated to a particular privacy budget. The resulting counts, or what are referred to as ‘noisy measurements,’ have the property that they conform to the DP criterion and the particular PLB determined by a data curator. But such data have some undesirable properties or produce data that do not satisfy certain consistency properties relevant for data from a census. In particular, these noisy measurements need not result in integer values and, more importantly, they need not result in nonnegative counts. Furthermore, the noisy measurements need not produce tabulations such that counts at smaller geographic units, for example, a state, county, or tract, add up to the actual population counts for more aggregated geographic units, that is, state populations aggregated to population counts for the nation, populations of a state’s counties add up to the state population, and so on. These properties of aggregation are essential for the U.S. Census, since, for example, population counts for states must be based on the actual population counts, and counts for substate geographic units must add up to these actual counts. This property is known as imposing invariants. To impose such invariants and constraints that require counts that are nonnegative and integer valued requires additional constraints to be imposed in the production of publicly released data products. The imposition of these constraints constitutes the postprocessing steps used by the Census Bureau in their TopDown Algorithm.

In their seminal paper, Dwork and Roth (2014) characterize the theoretical foundations of the use of algorithms that impose the DP criterion. One of the key properties that they prove (see Proposition 2.1 on p. 229) is that “differential privacy [and its privacy guarantees] is immune to post-processing” (Dwork & Roth, 2014, p. 228).

However, in the primary focus of their study, Gong and Meng (2020) note that the Dwork-Roth proposition on postprocessing relies upon a particularly important assumption, namely, that any postprocessing done to data produced with DP-based algorithms are functions only of the postprocessed data. But, as Gong and Meng note, the postprocessing constraints imposed in the TDA are functions of the confidential data, such as the invariant state population counts. As a result, the Dwork-Roth proposition of immunity of DP privacy guarantees to postprocessing does not actually apply to the sort of postprocessing that the Census Bureau imposed in its TDA to produce the 2010 Census demonstration products and the PL 94-171 2020 Census data. Thus, even if one were to accept the DP criterion as bounding an appropriate form of disclosure risks, the properties ascribed to the DP criterion may not hold in practice for its implementation as the Census Bureau’s DAS.

Finally, we note that the successes the Census Bureau achieved with producing fairly accurate data in its final demonstration file for use in redistricting may not carry over to its development of the DHC files from the 2020 Census that are currently scheduled for release in May of 2023. Recall that the improvements in data accuracy for the demonstration versions of the PL 94-171 data were achieved by ‘tuning’ the allocation of the overall privacy-loss budget, especially for low-density geographic areas.

4. Moving Forward: The Bureau’s Approach to Privacy Protection and Title 13

In this final section, we ask what is an ‘optimal outcome’? Is it achievable in the context of 2020 Census products, both the redistricting data already released and the planned release of Demographic and Housing Characteristics data? How does it relate to the privacy protections afforded by Title 13?

Meeting the current and future challenges to privacy noted above and the use of differential privacy to meet them, the discussion suggests to us that the requirements of Title 13 pose an unsettling dilemma. Title 13 is concerned with the absolute risk to privacy in that the identity of any individual respondent cannot be revealed. This requirement, put in place more than 65 years ago, prohibits the dissemination of individual responses by federal agencies for actions against individuals, thus assuring respondents that their data would remain confidential. Threats of disclosure from external sources as a means of identifying individuals was not nearly the consideration then that it is today.

From the earlier analysis, it is apparent that a DAS driven by DP does provide a guarantee of privacy against all attackers as a theoretical construct, but it does this in relative way—that is, the results from Gong and Meng (2020) and Kenny et al. (2021), discussed earlier, seem to raise serious concerns about the extent to which DP may minimize the potential disclosure risks that Title 13 covers. In particular, the DP criterion and the mechanisms used to impose it in the Bureau’s TopDown Algorithm require knowledge of the disclosure risks associated with external information in order to fully assess what bounds ε-differentially private mechanisms and the statistics they produce, that is, Mε(X)M_{\varepsilon}(X), from the confidential data, XX, place on the absolute disclosure risk that is the focus of Title 13. What this means is that the practical assessment of privacy risks requires knowledge of the disclosure risk from external information, t, that is of Pr(J=j,Xjt)\Pr\left( \left. J = j,X_{j} \right|t \right) and are, thus, relative, not absolute. The challenges that the Census Bureau faces today and in the future from the availability of more and more disclosive external data truly makes it more difficult to meet Title 13 privacy protections—a big dilemma regardless of what system is used.27

Yet, the absolute nature of Title 13 remains, in the midst of what is today a maelstrom of data from a seemingly endless list of sources and a future with untold possibilities. And, without taking a stand on the likelihood of disclosure threats that are due to external sources of data, the Census Bureau cannot quantify the implied bounds on the absolute risks of disclosure of its DP-based DAS for the 2020 Census data releases that are covered by Title 13. Moreover, neither can their DP-based DAS, or any other DAS for that matter, meet a literal reading of Title 13 to reduce the probability of disclosure to zero.

As a result, the privacy protections accorded to data releases from the 2020 Census still are subject to a normative interpretation of Title 13, derived by members of the DSEP committee, a group of high-level career professionals at the Census Bureau. Moreover, it is hard to understand how DSEP can make informed determinations of these risks, given that there is insufficient empirical evaluation of the privacy loss associated with incremental changes in the privacy-loss budget. In the case of the release of the PL 94-171 data in August 2021, the ‘solution’ of raising the privacy loss budget for this release did result in quantifiable improvement in data utility. However, its release was not accompanied, to date, with an estimate of privacy loss, which according to the earlier analysis (i.e., Equations 4 and 4') could be substantial. And, since the Census Bureau needs to work within a fixed PLB moving forward, it is not clear what the level of the PLB will need to be in order to provide any ‘optimal’ balance between accuracy and privacy as they roll out future products.

Finally, how can the Bureau educate data users and other stakeholders about the value of DP for maintaining privacy when it has not been able to quantify the marginal gains associated with the incremental increases in the privacy loss budget? Moreover, these stakeholders will be at a loss when attempting to explain the methods used by the Bureau in an effort to protect privacy. This flies in the face of the very transparency that the Census Bureau has argued since day one is provided by a DP-based DAS. To be fair, given the current interpretation of Title 13, this dilemma is likely to be faced by the Bureau going forward, irrespective of the methods used for protecting confidentiality of the data and the privacy of respondents.

Given these considerations, we strongly recommend that the Census Bureau immediately begin conducting a series of reconstruction and reidentification attacks on data products that it has released or plans to release. Furthermore, we urge the Census Bureau to release these results to the public so as to provide important context for a reexamination of Title 13 with Congress. The fact that Title 13 has not been revised in over 45 years suggests that actually revising it is not an easy task or something that is likely to happen soon. Nonetheless, we see a serious reexamination of it as needed and long overdue.28

However likely this reexamination is to occur, we offer the following suggestions for such options that should be considered if and when revisions of Title 13 are considered. An increased focus should be placed on:

(a.) Efforts to penalize the misuse of data released by the Census Bureau. This may entail making it a criminal act to misuse data released by the Census Bureau or other agencies, where misuse of the data would need to be defined by statutes, that is, defining ‘harmful’ uses of data, such as using it to locate ex-spouses or partners for the purpose of harming them or agencies/individuals using it to locate undocumented immigrants for the purpose of deportation.

(b.) Efforts to permit individuals who have been harmed by a misuse of released data to take civil action against the perpetrators of these harmful acts. A possible model for the provisions in (a) and (b) may be crafted along the lines of laws that are used to protect consumers from identity theft and the like.

(c.) Work to revise Title 13 to require the Census Bureau to be more transparent with individuals that a zero-risk of disclosure is impossible to achieve but that safeguards are in place to assure that the risk remains very low.

(d.) Funding to facilitate experiments and assessments on the willingness of respondents to surveys to give up some privacy, especially the loss of privacy for certain types of information about them. That is, people may not be as concerned about the potential disclosure of their race or age but may be very concerned about disclosing their income or their health status. Such information could help to target what information is more important to protect and what is less important.

(e.) More resources for the Census Bureau to partner with and educate the data user community on how best to communicate the responsible use of their privatized data and interpret findings from it. This should lead to standards about how to account for forms of nonsampling as well as sampling errors in released data.


Acknowledgments

We wish to acknowledge the useful conversations we had with Margo Anderson and Claire Bowen in the preparation of this paper. We thank the editor and three anonymous reviewers for their comments on earlier drafts of this article. We also wish to thank Ruobin Gong and Connie Citro for comments provided on an earlier draft of this paper. Finally, Hotz wishes to acknowledge his collaboration with Chris Bollinger, Tatiana Komorova, Charles Manski, Robert Moffitt, Denis Nekipelov, Aaron Sojourner, and Bruce Spencer on data privacy and usability within the U.S. federal statistical agencies and some of the ideas and points made in this paper. All errors and views expressed in this paper are the sole responsibility of the authors.

Disclosure Statement

V. Joseph Hotz and Joseph Salvo have no financial or non-financial disclosures to share for this article.


References

Asian Americans Advancing Justice-Mexican American Legal Defense and Educational Fund. (2020, April 25). Asian Americans Advancing Justice - AAJC and MALDEF recommendations to the data metrics overview.

Abowd, J. (2016, September 15–16). The challenge of scientific reproducibility and privacy protection for statistical agencies [Paper presentation]. Census Scientific Advisory Committee meeting, Suitland, MD. https://www2.census.gov/cac/sac/meetings/2016-09/2016-abowd.pdf

Abowd, J. M. (2021). Declaration of John Abowd, State of Alabama v. U.S. Department of Commerce. Case No. 3:21-CV-211-RAH-ECM-KCN, Document 41-1 Filed 04/13/21. https://www2.census.gov/about/policies/foia/records/alabama-vs-doc/alabama-ii-41-defs-pi-opposition-and-declarations.pdf?utm_campaign=20210419msdecs1ccdtar&utm_medium=email&utm_source=govdelivery

Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., & Zhuravlev, P. (2022). The 2020 Census disclosure avoidance system TopDown Algorithm. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.529e3cb9

Anderson, M. (2020). The census and the Japanese “internment”: Apology and policy in statistical practice. Social Research: An International Quarterly, 87(4), 789–812. https://doi.org/10.1353/sor.2020.0064

Anderson, M., & Seltzer, W. (2007). Challenges to the confidentiality of U.S. federal statistics, 1910–1965. Journal of Official Statistics, 23(1), 1–34. http://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/challenges-to-the-confidentiality-of-u.s.-federal-statistics-1910-1965.pdf

Association of Public Data Users. (2020, May 22). Letter to Director Steven D. Dillingham from APDU Executive Director Kenneth E. Poole, Arlington, VA.

boyd, d. (2020 May). Balancing data utility and confidentiality in the 2020 US Census. Data & Society, A Living Document. https://datasociety.net/library/balancing-data-utility-and-confidentiality-in-the-2020-us-census/

Cohen, A., Duchin, M., Matthews, J. N., & Suwal, B. (2021). Census TopDown: The impacts of differential privacy on redistricting [Paper presentation]. 2nd Symposium on Foundations of Responsible Computing (FORC 2021), June 9–11, online. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

Czajka, J. L., & Beyler, A. (2016). Declining response rates in federal surveys: Trends and implications. Mathematica Policy Research. https://www.mathematica.org/download-media?MediaItemId={EE7C8B55-B1F4-45E1-82FC-276AA4E4700A}.

Devine, J., Borman, C., & Spence, M. (2020). 2020 Census disclosure avoidance improvement metrics [Presentation]. Committee on National Statistics - National Academies Disclosure Avoidance Working Group, Washington, DC, March 18.

Duncan, G. T., & Lambert, D. (1989). The risk of disclosure for microdata. Journal of Business & Economic Statistics, 7(2), 207–217. https://doi.org/10.1080/07350015.1989.10509729

Duncan, G. T., & Lambert, D. (1986). Disclosure-limited data dissemination. Journal of the American Statistical Association, 81(393), 10–18. https://doi.org/10.1080/01621459.1986.10478229

Dwork, C., McSherry F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi & T. Rabin (Eds.), Theory of cryptography (pp. 265–284). Springer-Verlag. https://doi.org/10.1007/11681878_14

Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/0400000042

El-Badry, S., & Swanson, D. A. (2007). Providing census tabulations to government security agencies in the United States: The case of Arab Americans. Government Information Quarterly, 24(2), 470–487. https://doi.org/10.1016/j.giq.2007.02.001

Federal-State Cooperative Program for Population Estimates. (2020, April 27). Letter to Census Bureau Director Steven Dillingham from the FSCPE Steering Committee.

Garfinkel, S. (2017, September 14–15). Modernizing disclosure avoidance: Report on the 2020 Disclosure avoidance subsystem as implemented for the 2018 end-to-end test (Continued) [Presentation]. Census Scientific Advisory Committee, U.S. Census Bureau, Center for Disclosure Avoidance Research, Suitland, MD.

Gates, G. W. (2012). Confidentiality. In M. Anderson, C. Citro, & J. Salvo (Eds.), Encyclopedia of the U.S. Census (2nd Ed.; pp. 94–95). CQ Press.

Gong, R., & Meng, X-L. (2020). Congenial differential privacy under mandated disclosure. In J. Wing & D. Madigan (Eds.), Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference (pp. 59–70). Association for Computing Machinery. https://doi.org/10.1145/3412815.3416892

Groshen, E. L., & Goroff, D. L. (2022). Disclosure avoidance and the 2020 Census: What do researchers need to know? Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.aed7f34f

Hauer, M. E., & Santos-Lozada, A. R. (2021). Differential privacy in the 2020 Census will distort Covid-19 rates. Socius, 7, Article 2378023121994014. https://doi.org/10.1177/2378023121994014

Hotz, V. J., & Salvo, J. (2020). Assessing the use of differential privacy for the 2020 Census: Summary of what we learned from the CNSTAT workshop. https://www.amstat.org/asa/files/pdfs/POL-CNSTAT_CensusDP_WorkshopLessonsLearnedSummary.pdf

Imai, K., & Khanna, K. (2016). Improving ecological inference by predicting individual ethnicity from voter registration records. Political Analysis, 24(2), 263–272. https://doi.org/10.1093/pan/mpw001

JASON. (2022). Consistency of data products and formal privacy methods for the 2020 Census. JSR-21-02, The MITRE Corporation. https://www2.census.gov/programs-surveys/decennial/2020/program-management/planning-docs/2020-census-data-products-privacy-methods.pdf

Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T. R., Simko, T., & Imai, K. (2021). The use of differential privacy for census data and its impact on redistricting: The case of the 2020 US Census. Science Advances, 7(41), Article eabk3283. https://doi.org/10.1126/sciadv.abk3283

Krieger, N., Nethery, R. C., Chen, J. T., Waterman, P. D., Wright, E., Rushovich, T., & Coull, B. A. (2021). Impact of differential privacy and census tract data source (Decennial Census versus American Community Survey) for monitoring health inequities. American Journal of Public Health, 111(2), 265–268. https://doi.org/10.2105/ajph.2020.305989

Lambert, D. (1993). Measures of disclosure risk and harm. Journal of Official Statistics, 9, 313–331. http://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/measures-of-disclosure-risk-and-harm.pdf

McClure, D., & Reiter, J. P. (2012). Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data. Transactions on Data Privacy, 5(3), 535–552. http://www.tdp.cat/issues11/tdp.a093a11.pdf

McClure, D., & Reiter, J. P. (2016). Assessing disclosure risks for synthetic data with arbitrary intruder knowledge. Statistical Journal of the IAOS, 32(1), 109–126. https://doi.org/10.3233/SJI-160957

McGeeney, K., Kriz, B., Mullenax, S., Kail, L., Walejko, G., Vines, M., Bates, N., & García Trejo, Y. (2019). 2020 Census Barriers, Attitudes, and Motivators Study (CBAMS) survey report, Version 2.0. U.S. Census Bureau. https://www2.census.gov/programs-surveys/decennial/2020/program-management/final-analysis-reports/2020-report-cbams-study-survey.pdf.

McKenna, L. (2018). Disclosure avoidance techniques used for the 1970 through 2010 Decennial Censuses of Population and Housing. U.S. Census Bureau Research and Methodology Directorate. https://www2.census.gov/ces/wp/2018/CES-WP-18-47.pdf

Minnis, T. A., Dutta-Gupta, I., Saenz, T. A., & Ross, D. (2020, May 15). Letter to John Abowd and Victoria Velkoff on behalf of the AAJC, the Georgetown GCPI, MALDEF, and NCoC.

National Academies of Sciences, Engineering, and Medicine. (2020). 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. The National Academies Press. https://doi.org/10.17226/25978

National Academies of Sciences, Engineering, and Medicine. (2021). Principles and Practices for a Federal Statistical Agency (7th ed.). The National Academies Press. https://doi.org/10.17226/25885

National Conference of State Legislators. (2020, May 14). Letter to The Honorable Carolyn Maloney and The Honorable Jim Jordan from Tim Storey, Executive Director, Washington, D.C.

Native Hawaiian-Serving Organizations. (2020, April 28). Letter to Dr. Steven Dillingham RE: Recommendations on the Disclosure Avoidance System.

National Research Council. (2013). Nonresponse in social science surveys: A research agenda. The National Academies Press. https://doi.org/10.17226/18293

National Congress of American Indians. (2020, April 23). Letter to Steven D. Dillingham from Kevin Allis, Chief Executive Office.

Office of Management and Budget. (1994). Report on statistical disclosure limitation methodology. Statistical Policy Working Paper 22. Office of Information and Regulatory Affairs.

Population Association of America. (2020, May 14). Letter to Director Steven D. Dillingham from Eileen Crimmins (PAA President) and Kathleen A. Cagney (APC President), Alexandria, VA.

Reamer, A. (2019). Brief 7: Comprehensive Accounting of Census-Guided Federal Spending (FY2017). George Washington University Institute of Public Policy.

Reiter, J. P. (2005). Estimating risks of identification disclosure in microdata. Journal of the American Statistical Association, 100(472), 1103–1112. https://doi.org/10.1198/016214505000000619

Salvo, J. J. (2012). Census tracts. In M. Anderson, C. Citro, & J. Salvo (Eds.), Encyclopedia of the U.S. Census (2nd ed.; pp. 82–84). CQ Press.

Santos-Lozada, A. R., Howard, J. T., & Verdery, A. M. (2020). How differential privacy will affect our understanding of health disparities in the United States. Proceedings of the National Academy of Sciences, 117(24), 13405–13412. https://doi.org/10.1073/pnas.2003714117

Second War Powers Act (1942). Section 1402, Executive Order 9157, 77th Congress, 2d Session (S2208), Approved March 27.

Seltzer, W., & Anderson, M. (2007, March 29–31). Census confidentiality under the Second War Powers Act (1942–1947) [Paper presentation]. Population Association of America Annual Meeting, New York.

Soliciting Feedback (2018). Federal Register, Vol 83, No. 139 (July 19) p. 34111.

Sullivan, C. (1992). An overview of disclosure principles. Research Report Series, No. RR-92/09. Statistical Research Division, U.S. Bureau of the Census.

U.S. Census Bureau. (2018). 2020 Census operational plan, Version 4.0. https://www2.census.gov/programs-surveys/decennial/2020/program-management/planning-docs/2020-oper-plan4.pdf

U.S. Census Bureau. (2003). Monograph: Census confidentiality and privacy: 1790 – 2002. https://www.census.gov/library/publications/2003/comm/monograph-confidentiality-privacy.html

U.S. Census Bureau. (1994). 1990 Census procedural history. https://www.census.gov/history/pdf/1990proceduralhistory.pdf

U.S. Department of Justice (2021). “Guidance under Section 2 of the Voting Rights Act, 52 U.S.C. 10301, for redistricting and methods of electing government bodies.” September 1. https://www.justice.gov/opa/press-release/file/1429486/download .

Vadhan, S. (2017). The complexity of differential privacy. In Y. Lindell (Ed.), Tutorials on the foundations of cryptography (pp. 347–450). Springer. https://doi.org/10.1007/978-3-319-57048-8_7

Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., Obrien, D. R., Steinke, T., & Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vanderbilt Journal of Entertainment & Technology Law, 21(1), 209–275. https://scholarship.law.vanderbilt.edu/jetlaw/vol21/iss1/4

Zeisset, P. (1978). Suppression vs. random rounding: Disclosure avoidance alternatives for the 1980 Census. https://www.census.gov/library/working-papers/1978/adrm/cdar1978-01.html


©2022 by V. Joseph Hotz and Joseph Salvo. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?