The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA postprocesses the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This article describes the mathematics and testing of the TDA for this purpose.
Keywords: differential privacy, 2020 Census, TopDown Algorithm, redistricting data
Differential privacy (Dwork, McSherry, et al., 2006) (henceforth DP) is considered the gold standard in privacy-protected data publication—it allows organizations to collect and publish statistics about groups of people while protecting the confidentiality of their individual responses. Initially adopted by the U.S. Census Bureau in 2008 for the OnTheMap product (Machanavajjhala et al., 2008; U.S. Census Bureau, 2008), it has since seen development by Google (Bittau et al., 2017; Erlingsson et al., 2014), Apple (Apple Differential Privacy Team, 2017), Uber (Johnson et al., 2018), and Microsoft (Ding et al., 2017). This article describes the implementation of the TopDown Algorithm, an algorithm developed within the differential privacy framework for the 2020 Census of Population and Housing in the United States (Abowd, 2018b).
The 2020 Census implementation will be among the largest deployments of differential privacy using the trusted-curator (centralized) model and, arguably, will have the highest stakes of any deployed formal privacy system, since decennial census data are used for apportionment, redistricting, allocation of funds, public policy, and research. The TopDown Algorithm (TDA) is the name given to the DP mechanisms, optimization algorithms, and associated postprocessing that are used for research and production. TDA generates confidentiality-preserving person- and housing-unit–level data called microdata detail files (MDF) with demographic and housing-unit information from the resident populations of the United States and Puerto Rico. We refer to the system that carries out the TDA as the 2020 Census Disclosure Avoidance System (DAS). The DAS is the collection of computer programs that implements statistical disclosure limitation for the census. It replaces the record-swapping system used from 1990 to 2010 for decennial censuses.
This article provides an overview of the DAS and TDA, discusses the motivation for adopting DP as the privacy-loss accounting framework as compared to other disclosure avoidance frameworks, and summarizes the critical policy considerations that motivated specific aspects of the design of TDA. We focus on the implementation of TDA that the Census Bureau used to release the 2020 Census Redistricting Data (P.L. 94-171) Summary File (redistricting data, hereafter) (Final Content Design for the Prototype 2020 Census Redistricting Data File, 2018). We discuss the utility of the TDA, but leave for future work a discussion of how users can best analyze the released data and how the privacy semantics—the interpretation of DP—are modified in the presence of invariants.
The remaining sections of the paper are organized as follows. Section 2 lays out the rationale for adopting DP as the disclosure avoidance framework for the 2020 Census. Section 3 discusses the overarching policy and practical constraints that governed the implementation. Section 4 discusses the many sources of uncertainty that are inherent in producing census population data. Section 5 provides the schema, invariants (tabulations that are not processed by the differentially private mechanisms), and constraints for the specification of the redistricting data, the first data product released using the methods described in this article. Section 6 provides an overview of the DP mechanism implemented in the TDA as well as the relevant mathematical properties. Section 7 goes into detail about estimation routines and improvements to the algorithm. Section 8 describes the tuning and testing of the DAS over time. Section 9 describes a set of experiments carried out to study the effect of certain design choices on the utility of the results using the 2010 redistricting data. Section 10 concludes.
The decennial census data are used to apportion the House of Representatives, to allocate at least $675 billion of federal funds every year,1 and to redistrict every legislative body in the nation.2 The accuracy of those data is extremely important. The Census Bureau is also tasked with protecting the confidentiality of the respondents and the data they provide. This dual mandate presents a challenging problem because data accuracy and data privacy are competing objectives (Abowd & Schmutte, 2019). For any given privacy-protection procedure, more accuracy means less privacy, and the theory underlying differential privacy and related research have helped to quantify that trade-off. We use the term formally private to encompass frameworks that include the many variants of differential privacy. Pure differential privacy, approximate differential privacy, and concentrated differential privacy are all examples of formally private methodologies. A disclosure avoidance framework is formally private if it is constructed from randomized mechanisms whose properties do not depend on the realized confidential data nor, ideally, on limitations on the attacker’s information set. A formally private framework must provide an accounting system that quantifies the privacy loss associated with the ensemble of published query answers. TDA uses the formal privacy framework defined by zero-Concentrated Differential Privacy, which will be defined below.
There are several reasons why the Census Bureau introduced formally private methodology for disclosure avoidance. First, the vulnerabilities of traditional disclosure limitation methods are well known among privacy researchers and legacy systems have been attacked in various ways (Barth-Jones, 2012; Cohen & Nissim, 2020; Dinur & Nissim, 2003; Dobra & Fienberg, 2000; Dwork et al., 2017; Fienberg & Slavkovic, 2005; Garfinkel, 2015; Hansell, 2006; Homer, 2008; Kifer, 2009; Narayanan & Shmatikov, 2008; Sweeney, 2002; Wong et al., 2007). Access to petabyte-scale cloud-based computing resources and software libraries designed to use these resources has increased enormously. At the same time, the amount of commercially available or independently held data on individuals that could be used as auxiliary information in order to reidentify individuals in Census Bureau products has also exploded. Unlike the Census Bureau, which has operated under a strict data confidentiality statute since 1954 (Title 13 - Census, 1954), internet-based data aggregators like Apple, Facebook, Google, LinkedIn, Microsoft, Uber, Twitter, and many others only recently became subject to tight privacy regulation in the form of the California Consumer Privacy Act (Consumer Privacy Act, 2018) and the European Union General Data Protection Regulation (General Data Protection Regulation, 2016). The presence of vast arrays of personal information held by private companies and state actors in combination with software like Gurobi, CPLEX and GLPK designed to solve many NP-hard systems of billions of simultaneous equations finally realized the scenario that national statistical agencies have known was a vulnerability since Ivan Fellegi’s original paper on data confidentiality (Fellegi, 1972).
Formal methods like differential privacy were developed specifically to avoid the vulnerabilities of traditional disclosure limitation methods. Furthermore, in contrast to the previous disclosure avoidance methods used for decennial censuses, the exact methodology and parameters of the randomized mechanisms that implement differential privacy are transparent, meaning an organization can release the source code and parameters of the mechanisms, but not the actual random numbers used, without compromising the privacy guarantees (Dwork, McSherry, et al., 2006).
In addition to transparency, two other key properties make DP methods attractive compared with traditional disclosure avoidance methods. These are the absence of degradation under postprocessing and adaptive composition. First, postprocessing (Dwork, McSherry, et al., 2006; Dwork & Roth, 2014; McSherry, 2009) means that if we run an algorithm
To the best of our knowledge, no traditional disclosure avoidance method satisfies transparency, nondegradation under postprocessing, and composition (Abowd & Schmutte, 2015). In particular, the household record-swapping method used in the 1990, 2000, and 2010 Censuses (McKenna, 2018) is not transparent and degrades poorly under composition. Its parameters and the details of the swapping algorithm cannot be published without compromising the privacy guarantee. Aggregation, which was combined with swapping in previous censuses, does not preserve uncertainty about the underlying data under postprocessing because it allows reidentification via reconstruction of the microdata (Dinur & Nissim, 2003; JASON Group, 2020). DP mechanisms also degrade under composition but in a predictable and smoothly bounded manner, thus avoiding highly accurate database reconstructions if agencies reasonably control their expended privacy-loss budget. DP mechanisms do not degrade under postprocessing. For the traditional disclosure limitation method of cell suppression (Cox, 1980; Cox et al., 1987), an arbitrarily large part of the suppression pattern can be unraveled if an attacker knows the value of a single cell that the system assumed was not known, which can occur either as prior knowledge or because some other group did a cell-suppressed release of overlapping data but didn’t suppress the same cells. Cell suppression has formal guarantees, but they only hold against an attacker who knows exactly what was published and nothing else relevant. Suppression breaks down in a brittle, discontinuous way when the attacker knows more. Such knowledge may not exist when the data are published, but may enter the public domain at a later point in time. In contrast, formal privacy provides guarantees against general rather than specific classes of attackers. While the protection of formally private systems also degrades with uncoordinated independent data releases, this degradation is smoothly bounded and fully quantified by the cumulative global privacy-loss budget.
An important aspect of formal privacy is the acknowledgment that the attacker’s knowledge plays a large role in disclosure avoidance and that assumptions about this knowledge are inherently imperfect. Thus, the definitions used in formal privacy systems attempt to limit the extent to which the protections depend upon the specifics of the attacker’s prior information. These limitations usually take the form of worst-case analysis and in that sense can be viewed as a minimax approach to a class of privacy protection problems (Wasserman & Zhou, 2010).
Rather than attempting to make statements about the absolute privacy risk, which requires specifying the attacker’s information set and strategy, DP shifts the focus to relative privacy guarantees, which hold for arbitrary attacker prior beliefs and strategies. For example, rather than saying Alice’s data are not reidentifiable given the output of the DP mechanism, we say that Alice’s incremental risk of reidentification is measured by the difference between the output of the DP mechanism that did not include Alice’s data as input versus the same one that did. An attacker or user should be able to make a statistical inference about Alice—one that depends on aggregate data—but not an identifying inference—one that is enabled only by the presence of Alice’s information as input to the DP mechanism. The mathematical analysis that quantifies these limitations on identifying inferences is called privacy semantics (Balle et al., 2020; Dong et al., 2022; Kairouz et al., 2015; Kasiviswanathan & Smith, 2014; Wasserman & Zhou, 2010).
There are multiple definitions of formal privacy, each having slightly different motivation and mathematical primitives. We use two specific formalizations as part of the TDA. The primary definition is that of approximate differential privacy or
The reason for specifying two privacy definitions is that, during the course of implementing the TDA, we observed substantial accuracy improvements from adopting a discrete Gaussian mechanism (Canonne et al., 2021) for the generation of noisy measurements versus the geometric mechanism (Ghosh et al., 2012). While we quantify our privacy semantics using approximate differential privacy, privacy-loss accounting using the discrete Gaussian mechanism more naturally aligns with zCDP—allocating
Rather than determining a single
Definition 1 describes a property of the output distribution of DP mechanism
The data curator is the organization that collects and retains the data to be protected, often in a formal database. The 2020 Census application of DP uses the trusted-curator model, sometimes called the centralized DP model, meaning that a legally authorized entity (the U.S. Census Bureau) collects and stores the confidential data and is responsible for the release of tabulations (queries) based on those data, applying appropriate confidentiality protection techniques on the tabulations. This is in comparison with the untrusted curator model, or local DP model, in which confidentiality protection techniques are applied to individual data records as they are collected and thus the curator never observes the confidential data directly, only through the output of the local DP mechanism. We point out the differences between these two models because the DP methods associated with them are necessarily different. An early example of a privacy-protection method in the untrusted curator model is randomized response (Warner, 1965) in which the respondent answers a question truthfully with some predetermined probability, but otherwise supplies a fixed answer (or answers a different question). The advantage of the trusted curator model is that it allows for DP methods that are more statistically efficient, meaning that the results will be more accurate given the same privacy-loss parameter.
In designing disclosure avoidance methods for the 2020 Census, several practical and policy considerations played a large role. Decennial census products are generally tables with counts of individuals, households, group quarters residents, or housing units with certain characteristics. The tabulation systems in place for the 2020 Census assume that the input is an individual or housing-unit record-level data set in which each row represents a person or a particular housing unit. This type of data set is often called microdata. In order to align with the tabulation system, the design specifications for TDA mandated the production of microdata—a record-level image for each person (or housing unit) that, when tabulated, produces the protected query answers (not their confidential counterparts).
Producing privacy-protected microdata is not a common requirement for disclosure avoidance systems implemented using the centralized DP model, which are often designed to produce an unbiased noisy measurement of the answer to the original confidential query according to the use case for the queries. An important implication of the microdata requirement is that it forces the final output of TDA to be nonnegative and integral because there is no accommodation for a negative weight or fractional person in the Census Bureau’s microdata tabulation systems.5 From a practical point of view, this makes complete sense. Some might find it strange if the Census Bureau reported that there were
There are two ways that the microdata requirement could be met by the disclosure avoidance system. The differentially private mechanism could directly output data consistent with the microdata requirement. Alternatively, the output of the differentially private mechanism could be postprocessed to conform to the microdata requirement. The first approach could theoretically be achieved by using the exponential mechanism (McSherry & Talwar, 2007); however, that would require the mechanism to use the fully specified output range schema because the exponential mechanism relies on sampling from that range. For our application, that approach was infeasible because it required enumerating the discrete elements of the output space that simultaneously met the microdata, invariant, edit constraint, and structural zero requirements. Not only would this have required a new theoretical apparatus to solve, but during the course of the development of the DAS the requirements changed several times. Instead, the DAS uses postprocessing to impose these requirements hierarchically using a preexisting framework from the operations research literature, specifically total unimodularity (Nemhauser & Wolsey, 1999, Sections III.1 and III.2).
There are some noticeable downstream consequences of imposing nonnegativity. If the confidential count of characteristic
A second, but related, requirement for the 2020 Census implementation was internal consistency, which is a natural property of tabulating all statistics from the same record-level input data with constant, nonnegative weights (all unity for the 2020 Census). Internal consistency means that a count of interest will be exactly the same if it appears in multiple tables or can be calculated from taking the sum or difference of counts from other tables. Consistency is a consequence of the algorithm producing privacy-protected microdata, but it is possible to have consistency without requiring tabulation from microdata.
A third requirement for the TopDown Algorithm was the implementation of invariants, edit constraints, and structural zeros. By invariants we mean counts that deliberately bypass the DP mechanism and are reported with certainty exactly as they are tabulated from the confidential data. The most notable invariants for the 2020 Census are the state total resident populations because of their essential function in the apportionment of the House of Representatives. Edit constraints and structural zeros are rules determined by Census Bureau subject matter experts, then applied to the confidential data as part of the processing that produces the Census Edited File, one of the two inputs to TDA. An example of an edit constraint is the requirement that a mother be at least some number of years older than her oldest natural child.
The privacy loss associated with the invariants is not quantifiable in the standard DP literature, and therefore the inclusion of invariants in the DAS complicates the interpretation of the privacy guarantees. Instead of going into detail here, we save this topic for a separate paper and note that the privacy-loss accounting for the DAS presented in this article ignores any contributions from the invariants.
One of the most important and challenging design considerations for the 2020 Census DAS is taking into account the multidimensional nature of the utility of the census data. While apportionment is perhaps the most important use case, census data are used for hundreds if not thousands of public, private, and commercial applications ranging from redistricting, to population age-distribution estimation, to characteristics of the housing stock. Therefore, the TDA must produce output that satisfies the same edit constraints, where feasible, with controllable accuracy, in the sense of data utility, for all of its use cases. From an economic point of view, algorithmic design and implementation for TDA can be seen as allocating a scarce resource (the finite information content of the confidential data) between competing use cases (different accuracy measures) and respondent privacy loss (Abowd & Schmutte, 2019).
Two excellent examples of competing use cases are redistricting and intercensal population estimates. Redistricting consists of drawing contiguous geographic entities that are mutually exclusive, exhaustive, and divide a political entity into voting districts of approximately equal populations. These districts also must satisfy the antidiscrimination provisions in the 1965 Voting Rights Act (42 U.S. Code § 1973). To satisfy these requirements, nonpartisan redistricting experts work with the Census Redistricting and Voting Rights Data Office (part of the Census Bureau) to designate the smallest unit of geography—the atom from which the districts are assembled—and the taxonomy for race and ethnicity categories, which must comply with Office of Management and Budget Statistical Policy Directive 15 (Office of Management and Budget, 1997). Population estimates require annually updated populations for approximately 85,000 politically and statistically defined geographic areas and age by sex by race by ethnicity tables for all counties in the United States and municipios in Puerto Rico. Accuracy for redistricting targets geographic areas that are not yet defined when the disclosure avoidance is applied for statistics on population in 252 categories (voting age by race by ethnicity). Accuracy for population estimates targets one statistic (total population) in predefined geographies ranging in size from a few hundred to millions of people and about 2,000 statistics (sex by age in single years by major race and ethnicity categories) in counties and municipios. Accuracy for redistricting is improved primarily by allocating privacy loss to the lowest levels of geography (block groups and blocks in census terminology) and the most detailed race and ethnicity summaries. Accuracy for population estimates is improved by allocating privacy loss to more aggregated geographies (counties and tracts) and favoring detail on age over detail on race and ethnicity.
The use of DP mechanisms for the 2020 Census DAS will not be the first time uncertainty is deliberately introduced into the tabular summaries from a decennial census to enhance disclosure avoidance. And in any census, disclosure avoidance uncertainty is not the only source of uncertainty affecting the published statistics. Historically, the decennial census has used data suppression, blank and replace, swapping, and partial synthetic data for disclosure avoidance in the tabular summaries (McKenna, 2018). In addition, sampling and coarsening along with swapping and partially synthetic data were used for public-use microdata samples (McKenna, 2019). These methods also create uncertainty in inferences using the published data. Suppression creates a nonignorable missing data problem because the suppression rule depends on the size of the statistic being suppressed and on the correlation of the complementary suppressions with primary suppressions. Swapping and partially synthetic data affect inferences because of the randomness used by both procedures (Abowd & Schmutte, 2015). However, the Census Bureau, like most national statistical agencies, does not provide any quantification of the uncertainty due to suppression, swapping, or partial synthesis. Consequently, the effect on inferences from the published data is usually not acknowledged because researchers are unable to quantify it. For example, in the case of traditional swapping, the parameters of the swapping algorithm are secret. In the case of disclosure avoidance systems implementing DP mechanisms, the parameters and distributions of the mechanism’s randomness are public and therefore can be quantified. TDA is an inference-valid disclosure avoidance system. Its uncertainty can be quantified and that quantification can be used to guide inferences from the published data; though, this can often be nontrivial.
The census-taking process itself can be viewed as a random event. Modern incomplete data inference has formalized the many reasons why a realized census does not produce perfect, complete data. Post-enumeration surveys have shown the undercoverage of persons with some characteristics and the overcoverage of persons with other characteristics (Hogan et al., 2013). Additionally, observed data items can be missing or erroneous for some individuals. Given the observed data, missing data techniques are used to produce the Census Edited File (CEF)—the final version of the confidential census data. Statistically, the CEF is the completed census data from a collection of single-imputation missing data models.6 In the CEF, there is a record for every housing unit, occupied group quarters resident, and housing unit resident person alive on April 1, 2020, with all tabulation variables containing valid responses on every record. There is an enormous investment in the technology of specifying this ancillary estimator for the completed census data, given the observed census data, even when there is no adjustment for differential net undercoverage. Finally, disclosure avoidance techniques are applied to these completed census data to produce tabular summaries that protect confidentiality. In this article, we use the CEF as the standard for estimating the accuracy of the DAS output. For better understanding the uncertainty introduced by disclosure avoidance, we compare the DAS output to small-scale simulations that quantify the uncertainty from census operational, measurement, and coverage errors for the 2010 Census (Bell & Schafer, 2021).
The Redistricting Data (P.L. 94-171) Summary Files are an essential data product as they are the basis for state redistricting, official population counts, and other statutory uses. The tables are therefore sometimes called the ‘redistricting data’ tables. Before going into additional details, it is important to have a sense of the geographic hierarchy that the Census Bureau uses in the creation of data products.7 The primitive element of the tabulation geographic hierarchy is the census block. All other geographic entities (states, tribal census areas, counties, places, etc.) can be defined as aggregations of blocks. The ‘spine’ of the tabulation geographic hierarchy consists of the United States, states (including the District of Columbia), counties, tracts, block groups, and blocks.8 In each case, the entities at lower levels of geography on the spine aggregate without partitions to entities above them. That is, on the tabulation geographic spine a block belongs in one, and only one, block group; a block group belongs in one, and only one, tract; and so on up the spine. This tabulation geographic spine plays an important role in the TopDown Algorithm. Other geographic aggregations (e.g., congressional districts, ZIP Code tabulation areas, incorporated places, etc.) are considered ‘off-spine’ because they don’t fit directly onto the tabulation spine’s hierarchy.
The redistricting data tables for 2020 (U.S. Census Bureau, 2020a) are
P1: Race
P2: Hispanic or Latino, and not Hispanic or Latino by race
P3: Race for the population 18 years and over
P4: Hispanic or Latino, and not Hispanic or Latino by race for the population 18 years and over
P5: Group Quarters (GQ) population by major GQ type
H1: Occupancy status.
Each table is produced down to the census block level. In order to tabulate from microdata, it is necessary to consider only a subset of census attributes with mutually exclusive levels. We designate these as the detailed queries for the person and housing-unit data, respectively. For person data the following attributes and levels constitute the detailed query to create the redistricting data tables:
Block. 5,892,698 levels9
Household or Group Quarters Type. 8 levels: household, correctional facilities for adults, juvenile facilities, nursing facilities/skilled-nursing facilities, other institutional facilities, college/university student housing, military quarters, other noninstitutional facilities;
Voting Age. 2 levels: age 17 or younger on April 1, 2020, age 18 or older on April 1, 2020;
Hispanic/Latino origin. 2 levels: Hispanic/Latino, not Hispanic/Latino;
Census race. 63 levels: every combination of Black/African American, American Indian/Native Alaskan, Asian, Native Hawaiian/Pacific Islander, White, and some other race, except “none of the above.”
Excluding block, there are 2,016 mutually exclusive, exhaustive combinations of these attribute levels. Including block, there are 11,879,679,168 levels, of which the number of persons in each must be output by the DP mechanisms so that TDA can be used to tabulate each of the redistricting data person-level tables.
For housing-unit data the following attributes and levels constitute the detailed query to create the redistricting data tables:
Block. 5,892,698 levels;
Occupancy status. 2 Levels: occupied, vacant.
Therefore a total of 11,785,396 mutually exclusive, exhaustive combinations of the attribute levels must be output by the DP mechanisms so that TDA can be used to tabulate each of the redistricting data housing-unit tables.
In addition to the measurements associated with the specified tables, the TDA is programmed to bypass the DP mechanisms when instructed by policymakers at the Census Bureau. Queries that bypass DP mechanisms are called invariants. Mathematically, an invariant is equivalent to designating infinite privacy loss for the specified tabulation, putting such queries outside the formal guarantees of differential privacy. In addition, invariants challenge the implementation of TDA because they may force the postprocessing phase of TDA to attempt to solve computationally infeasible problems (see Section 7). As a consequence, the Census Bureau policymakers have specified invariants only when there was either a constitutional mandate or an operational constraint in the conduct of the census.
The constitutional constraint comes from the mandate to apportion the House of Representatives based on the “actual Enumeration” of the population on census day. Technically, this constraint only requires the allocation of the seats in the House to be invariant—not subject to changes due to the confidentiality protection system; however, DSEP determined that setting the populations of the 50 states, District of Columbia, and Commonwealth of Puerto Rico invariant was the most transparent and practical way to accomplish this objective. The policymakers accepted the risk associated with releasing these data outside the enhanced protections provided by differential privacy; that is, aggregation is the only disclosure avoidance applied to the state populations.
The operational constraints come from the methods used to keep the Census Bureau’s Master Address File (MAF)—the frame for the decennial census and many other surveys—up to date. The MAF is a database of known addresses and features or characteristics of these addresses. The most important feature is living quarters, which means that the address designates a domicile capable of housing at least one human being. Fixed domiciles come in two types: housing units, which usually domicile a single household, and group quarters, which usually domicile a group of unrelated individuals.10 There are also living quarters that are not fixed domiciles such as service-based facilities (soup kitchens, shelters, etc.) and transient locations (tents, campgrounds, etc.). In addition, there are addresses for facilities that are not deemed living quarters at all, such as businesses, highway medians, and fire hydrants.
Maintaining the MAF is a continuous process involving many operations. A few are salient to the disclosure avoidance system. First, by statute, the Census Bureau conducts the Local Update of Census Addresses (LUCA) late in the census decade. The LUCA operation enlists partners at all levels of government—states, recognized tribes, counties, and municipalities—to review the addresses in the MAF. In support of this operation, the Census Bureau publicly distributes certain summaries of the MAF, among these public tabulations are the number of addresses in each block that represent housing units and the number that represent group quarters of different types. While the addresses associated with these housing units and group quarters remain confidential, the counts are used to guide local partners to areas they believe may be incomplete. In these areas, the local partners supply candidate addresses, which are used to update the MAF. Second, throughout the decade, the Census Bureau cooperates with local expert demographers to update the location and types of group quarters facilities in an area. Since group quarters are in scope for the American Community Survey, this cooperation also affects that survey. Critically, the local demographers use public information on facilities like prisons, nursing homes, dormitories, and barracks to facilitate timely updates of the MAF group quarters data. For operational reasons, then, the Census Bureau treats the count of physical addresses on a block, their designation as housing units or group quarters, and the type of group quarters facility as public information. Note that while unoccupied (vacant) housing units are enumerated in the census, vacant group quarters are not. Hence, TDA treats housing units and occupied group quarters as invariant. The Census Bureau policymakers interpret this decision as requiring the disclosure avoidance to be consistent with public information. This decision means that a prison, or other group quarters, will be properly labeled in census tabulations if it is properly labeled in the confidential data, but the number and characteristics of the population living in those facilities or in housing units is protected by the DP mechanisms.
The complete list of invariants implemented in TDA is:
total population for each state, the District of Columbia, and Puerto Rico;
the count of housing units in each block, but not the population living in these housing units;
the count and type of occupied group quarters in each block, but not the population living in these group quarters.
Edit constraints and structural zeros occur when a particular combination of values of the characteristics in the confidential data is deemed impossible a priori; that is, before any data are collected. For example, if the relationship of person two to person one (also called the householder) is “natural child,” then person two cannot have age greater than person one’s age. Edit constraints and structural zeros modify the sample space of the final output data. In the case of the DAS, this would be the postprocessed results (MDF) but not the intermediate noisy measurements.
Edit constraints are conceptually different from invariants; however, when examining the full constraint set in TDA, they are mathematically indistinguishable. Nevertheless, it is important to maintain the distinction. Edit constraints and structural zeros are imposed on the confidential data during the process of creating the Census Edited File. They are never violated in the input data to TDA by design; therefore, they must also be respected in the output data. Invariants, on the other hand, are statistics computed from the input data that deliberately skip the differential privacy mechanism. They are passed to the postprocessing without noise injection. Hence, the distinction is that edit constraints and structural zeros define the sample space for the census. They are features of its design, not part of the information collected from respondents. Edit constraints and structural zeros are public information known before data collection, while invariants are statistics collected in the census data that bypass the confidentiality protection system by policy.11
The complete edit specification for the 2020 Census Edited File is not a public document, nor is the complete edit specification for the 2010 Census Edited File. The version of TDA that processed the redistricting data product contains only a few edit constraints and structural zeros. The most important of these are listed here:
for the race variable, the combination “none of the above” is a structural zero;
for occupancy status, there are no unoccupied group quarters.
The race constraint is derived from OMB Statistical Policy Directive 15 (Office of Management and Budget, 1997), which requires the use of the categories listed in the redistricting data schema except for “some other race,” which was congressionally mandated (Consolidated Appropriations Act, 2010). Because a respondent cannot select “none of the above,” there are
The structural zero on unoccupied group quarters, the occupied group quarters invariant, and the housing unit invariant interact in creating the TDA constraint set. If there is an occupied group quarters facility of a particular type in the relevant geographic area, then at least one person must be assigned in postprocessing to that facility. If there is a housing unit in the particular geography, however, there is no similar requirement because occupancy status is not invariant for housing units. However, the number of householders (person one on the questionnaire) cannot be greater than the number of housing units. Aside from this, TDA treats the data for persons and housing units separately. Hence, occupancy status is protected without reference to the person data. This interaction is deliberate. Joining the housing unit data to the person data and then applying a differentially private mechanism to the joined data is a much harder problem because the sensitivity of queries (see Definition 3) about the composition of a household, including how many persons live in the housing unit, is the maximum allowable household size, not one. Join queries like this will be processed by DP mechanisms implemented in algorithms that do not postprocess to microdata. These data products will be released at a later date as part of the Detailed Demographic and Housing Characteristics (Detailed-DHC) data release. Users may be tempted to divide the total household population in a particular geography from table P1 by the total number of occupied housing units in that geography from table H1. For well-populated geographies this answer will be useful; however, for sparsely populated geographies it will not be reliable. The join query DP mechanisms implemented in the Detailed-DHC are designed to produce reliable data for average household size.
The TDA and the block-by-block algorithm, which we describe in Appendix A, rely on DP mechanisms responsible for generating the noisy measurements. For the TDA, we use the discrete Gaussian mechanism, based on the discrete Gaussian distribution (Definition 4). The multivariate discrete Gaussian mechanism is described in Theorem 1. Setting
The discrete Gaussian mechanism was not always the DP mechanism in the TDA. In earlier work we used the geometric mechanism, which, like the discrete Gaussian mechanism, produces noise with integer values but uses the double-geometric rather than the discrete Gaussian distribution. The discrete Gaussian mechanism was ultimately chosen based on empirical tests of the accuracy of the TDA using each mechanism with comparable privacy-loss budgets. The discrete Gaussian distribution has smaller tail probabilities than the double-geometric distribution, and this provably reduces the worst-case errors associated with the microdata requirement (Abowd, Ashmead, Cumings-Menon, Garfinkel, et al., 2021).
The census-taking process results in a person-level data set of
Geographic levels play an important role in the methodology. It is useful to consider the cardinality of the sample space disregarding geography. Let the sample space without geography be represented by
Instead of representing the data as an
A set of linear queries on
A key feature of census data and tabulation products is geography. We use the symbol
Let
The TopDown Algorithm processes the noisy measurements from the DP mechanism and the invariants into an estimate of the contingency table vector. We use
Depending on the specifics of the invariants and edit constraints, both equality and inequality constraints may be imposed within TDA. Specifically, in the redistricting data use case, inequality constraints are due to invariants imposed on the counts of housing units and occupied group quarters because these put bounds on the number of people that must be in that geographic unit. For example, for the invariant on the number of occupied group quarters facilities of a certain type by block, there must be at least one person per group quarters type in that block because vacant group quarters facilities are out-of-scope for the census. We denote the constraints on the algorithm output as a result of the invariants as a set of equality and inequality constraints:
where
The census ultimately produces a set of tabulations that can be characterized as a query matrix
The utility of the algorithms developed for the redistricting data was first and foremost designed to meet stringent fitness-for-use accuracy targets for the redistricting and Voting Rights Act use cases. This is a challenging task because the geographic units of interest are the voting districts that will be drawn after the data are published. Hence, the target geographies cannot be specified in advance. Fortunately, these voting districts divide political entities like states, counties, incorporated places, school districts, and tribal areas whose geographies are prespecified. Therefore, error metrics that optimize accuracy within these predefined political entities help properly control the metrics for their equal-population future voting districts. The collections of blocks in the voting districts are not arbitrary. They cover the political entity and form aggregates of approximately equal total population. When the political entity has a low population, there are usually only a few voting districts or none at all. While a complete description of the criteria for forming legislative districts is beyond the scope of this article, more details about the compactness, contiguity, political boundary, and community of interest requirements are provided by the nonpartisan organization that represents states’ interests in the development of redistricting data, the National Conference of State Legislatures (2021).
We performed explicit experimental parameter tuning using 2010 Census data to ensure that the largest racial/ethnic group (as defined in Statistical Policy Directive 15 [Office of Management and Budget, 1997]) in off-spine geographies (e.g., minor civil divisions, census places, American Indian Reservations, etc.) with populations of at least
Independently, we also evaluated the performance of TDA by examining the absolute and relative error of many statistics by quantiles of the count in the 2010 CEF. These error metrics were used for total population, voting-age population, number of races, population in each OMB race category, and population in each ethnicity category within OMB race category. These quantile statistics were used primarily to determine the interactions of changes in the privacy-loss budget allocation among the queries, which cannot be done ex ante because, while the errors from the DP mechanism are independent of the data, the errors in postprocessing are not.
Our accuracy assessments included many metrics (U.S. Census Bureau, 2020b, 2021c) using the 2010 Census data that were developed by demographers at the Census Bureau or suggested by external users. They included many different error measures such as mean absolute error, mean absolute percent error, percent difference thresholds, outliers, as well as different characteristics of interest relevant to redistricting (total populations, total population 18 years and over, occupancy rates, Hispanic or Latino origin population, etc.) across many different geographic levels.
The existing policy and scientific literature provide very little guidance on managing a privacy-loss budget to trade-off accuracy on multiple dimensions. When tuning the TDA, our experiments were designed to find the smallest privacy-loss budget that met specified accuracy goals. In particular, we used a tight standard for redistricting data accuracy in the smallest voting districts that the Department of Justice Voting Section provided as examples of Section 2 scrutiny under that law. Finding the minimum privacy-loss budget that could achieve this goal illustrated where other accuracy objectives deteriorate—in particular, statistics at the tract and county level, which were down-weighted by the redistricting accuracy measure. Other subject matter experts, internal and external, then laid out their accuracy goals for these other statistics. Internal statistical disclosure limitation experts, including some of the authors of this article, demonstrated that these accuracy goals—redistricting and general demographic—could be achieved with a modest privacy-loss budget at the block level. The iterative experimental tuning process is described in more detail in Section 8.
The statistical optimization problem can be summarized as: given accuracy targets based on
that meets the accuracy targets such that the total privacy-loss budget
choosing which query matrices to use for differentially private noisy measurements;
choosing the accuracy of each of the noisy measurements; and
optimally combining the noisy measurement information to produce a nonnegative integer estimate of
While these algorithm elements should work hand-in-hand, we note that there could be many variations of (3) using the same measurements that are a product of elements (1) and (2) that produce results with good utility. Not only are there many objective functions that can be used for optimization, but many variations of methods to produce nonnegative integers satisfying a set of constraints. Over the course of this work, we experimented with each of these elements in order to build the best algorithm for key redistricting and demographic use-cases.
While it would be possible to ask a linear query
We use the term marginal query group to refer to a matrix with rows given by linear queries such that the product with the data vector,
The individual marginal query groups we consider have two important properties. They are exhaustive in terms of the levels (we do not query 18 and older without also querying 17 and younger) and the levels are mutually exclusive. Along with the use of binary elements, these properties ensure that the
The motivation for the TDA is to leverage the tree-like hierarchical structure of census geographies in order to produce estimates of the contingency table vector that are more precise than algorithms focused on directly and independently estimating block-level vectors then aggregating16 in a scalable manner. The TDA can be thought of as two phases, a measurements phase and an estimation phase.17 Table 1 presents a schematic summary of the algorithm.
First, in the measurement phase, sets of differentially private noisy measurements are taken about key queries at each of the geographic levels in the hierarchy. Because queries at different geographic levels (say block and tract) will involve the same individuals, the total privacy loss will be the summation of the privacy loss at the individual geographic levels. Therefore, the distribution of the total privacy-loss budget across levels of the geography as well as the queries within a geographic level must be chosen with care. In the estimation phase, the goal is to produce a nonnegative integer estimate of the contingency table vector from the differentially private noisy measurements.
As the name implies, estimation is carried out sequentially, starting with the root geography (United States). The algorithm estimates the nonnegative integer U.S.-specific contingency table vector from the set of U.S.-level differentially private queries on the data, maintaining any equality or inequality constraints implied by the invariants or due to edit constraints and structural zeros. Due to the complexity of the problems, this is done in a two-step manner. First, we estimate a constrained nonnegative weighted least squares solution from the U.S.-level differentially private queries that respects all constraints implied by the invariants. Second, we perform a controlled rounding that finds a nearby nonnegative solution that also respects the constraints. See Section 7.1 for more detail. Once the solution is found, the U.S.-level estimated contingency table vector is considered fixed and used as an additional constraint on the next subproblem in the sequence.
The algorithm then estimates the state-level contingency table vectors, enforcing an aggregation constraint to the U.S.-level contingency table vector and using the state-level differentially private measurements on the data. This is done using a similar two-step process: first, solving a weighted nonnegative least squares problem; then, solving a controlled rounding problem. However, in addition to any constraints that are the product of invariants or edit specifications, an aggregation constraint is added to both subproblems to ensure that the estimated state-level contingency table vectors sum to the U.S.-level estimated contingency table vector.
Fixing the estimated state-level contingency table vectors, the estimation steps are repeated within each state in order to estimate each of the county-level contingency table vectors, enforcing aggregation constraints for each state. The algorithm then proceeds iteratively to additional geographic levels (tract, block group, and block) until the contingency table vector is estimated for each block. In each step, the differentially private measurements taken at that level are used along with constraints due to invariants, edit specifications and aggregation.
In an ideal scenario, the contingency table vector
In this section, we present the estimators for the least squares and rounder optimization problems, respectively, that we initially developed. Then, we provide the details of enhancements to these basic estimators using more complex estimators that find a solution by performing a series of optimizations and constraining passes. We detail these more complex estimators in Sections 7.2–7.3.
The basic least squares estimator. There are two versions of the least squares estimation problem depending upon whether the subproblem includes a single node—as in Step (1)(a) of the estimation phase of TDA—or multiple nodes—as in Step (2)(a) of the estimation phase of TDA. Let
Now, let
In (7.1) and (7.2) we use generic notation for
The basic rounder estimator. Similar to the least squares optimization problem, we specify two versions of the solution depending on whether we are estimating over a single or multiple nodes. Let
Then the rounder estimator for the joint solution for the children of
DAS implements the rounder problem by subtracting the floor and minimizing the distance over the fractional part of the target vector. This problem is feasible and efficiently solvable if it has the total unimodularity (TUM) property (see Section 7.2).
While we found that the basic least squares estimator worked well overall, we observed a concerning trend in the bias of the estimation of geographic totals. Even when these queries were asked with very high precision, when combined in an optimization with a query with a large number of components (e.g., the detailed query), the resulting estimated data vector would be less accurate on the target query than would be expected. We know that this problem is due to the constraint that all cell counts must be nonnegative. We now understand that there is an inherent uncertainty principle limiting the accuracy of inequality-constrained least squares. The maximum expected error in either the detailed query or its total depends upon both the privacy-loss budget allocation and the total number of queries (Abowd, Ashmead, Cumings-Menon, Garfinkel, et al., 2021).18 Managing this accuracy tradeoff between detailed and total queries, which is due entirely to the postprocessing requirement to produce privacy-protected microdata, motivated the algorithmic improvements discussed here.
To improve on the single-pass nonlinear least squares estimator, we developed a multi-pass estimator. Instead of attempting to find a data vector solution that minimizes errors for all noisy queries at once, we use multiple passes of the optimizer to estimate increasingly more detailed tabulations of the data vector. On each pass, after estimating one or more margins of the contingency-table vector, we constrain the following passes to find a solution that maintains the previously estimated margins. For example, for the redistricting data persons schema, we might first estimate and constrain the total population counts in a first pass. Then, estimate the detailed query in a second pass subject to the constraint that its margin equals the total population count from the first pass. Multi-pass processing, therefore, controls the algorithmic error due to the nonnegativity constraints by prioritizing the queries based on the expected size of the total query (total population before voting-age population, etc.).
The input of the least squares multi-pass methods described here is a collection of marginal queries and, for each such query group, the vector of DP random measurements for the query group. Let
The parameter
Note that the basic least squares estimator can be recovered by choosing
The basic rounder estimator finds a nearby integer solution by minimizing the distance from the real-valued solution with respect to the individual cell values while still respecting constraints. We found that, as the size of the optimization problems grew, even though in the rounder solution individual cells were forced to be close to their real-valued counterparts, the summation of many cells (e.g., a variable margin) could still change by a large amount. Changes to the individual detailed cell values could add up to make a large change in, for example, the voting-age population. To improve this behavior, we developed an estimation routine that optimizes based on the distance from the real-valued solutions to the queries used in the least squares estimator. Like the least squares multi-pass estimator, we also allowed for multiple passes of the rounder at increasing levels of detail.
Because the nonnegative least squares estimator is solved numerically, the amount of mass (accumulated fractional parts of the least squares solution) the rounder was responsible for estimating increased as privacy-loss budgets increased. That is, as we increased the privacy-loss budget, we wanted to ensure monotone convergence to perfect accuracy. However, when using the original, basic single-pass rounder, as the privacy-loss budget increased many more cell estimates from the least squares problem were 0.999 (not 1.0). As this happened, because the rounder is responsible for estimating a total amount of mass that it is equal to the sum of the fractional parts of the least squares estimator, the rounder became increasingly important to solution quality the larger the privacy-loss budget. Rounding using only the detailed query answers is clearly an inferior estimator of most queries of interest compared to using all queries from the least squares solution. Hence, basic rounding caused very troubling behavior: specifically, an intermediate region in which increased privacy-loss budget resulted in poorer performance on utility metrics. When we implemented multi-pass rounding, we also added the ability to specify additional queries as part of the rounder’s objective function. TUM implies restrictions on these extra queries. Without violating TUM, we inserted a single additional set of hierarchically related queries (a ‘tree of queries’) in this way. This modification of the objective function solved the nonmonotonicity problem.
Let
We note that the existence of a solution for the multi-pass rounder is not always guaranteed and depends heavily on the choice of the query groups used in each pass and their ordering. In particular, to guarantee polynomial time solution of the rounder estimator, the mathematical constraints must exhibit TUM, the condition that every square nonsingular submatrix is unimodular (integer and has determinant +1 or -1). TUM does two things. First, if a continuous solution exists to the nonnegative least squares problem problem, then TUM implies that an integer solution exists to the rounder problem. Second, TUM implies that finding the rounder solution is tractable. TUM does not, unfortunately, guarantee that the continuous nonnegative least squares problem has a solution in the first place. Because the mathematical constraints in the rounder estimator are composed of interactions between the invariants, edit constraints, and hierarchical consistency constraints, ensuring TUM requires limiting invariants and edit constraints as much as possible.
American Indian and Alaska Native (AIAN) areas tend to be far off the standard tabulation spine and, therefore, are at risk of poor accuracy for a given privacy-loss budget. For example, the largest AIAN tribal area is the Navajo nation, which includes subsets of many counties within the states of New Mexico, Arizona, and Utah. To improve the accuracy of the AIAN areas, we define what we call the AIAN spine, which subdivides each state with any AIAN tribal areas into two geographic entities: the union of the legally defined AIAN areas within the respective state and the remaining portion of the state.20 Counties, tracts, and block groups are also subdivided as necessary to be fully nested within either the AIAN branch or the non-AIAN branch of the state’s hierarchy. Note that AIAN areas, like all tabulation geographic entities, are composed of tabulation blocks; therefore the creation of this spine did not require the census blocks to be subdivided.
In the 2010 Census geography definitions, there were 36 states with one or more AIAN area on the spine. That number increased to 37 in 2020. Functionally, these new geographic units are treated as state equivalents in the TDA. Meaning there are 87 (88 in 2020) state-equivalent geographies within the United States used in TDA.
Creating the AIAN spine created a subtle, but important, modification to the way the least squares and rounder problems were solved. In states with AIAN areas on the spine, the TDA hierarchy branches at the state level. This means that at the state level, the AIAN total population and the non-AIAN total population are solved with privacy-loss budget. Only their sum is invariant. For states without AIAN areas on the spine total population is invariant and there is no need to estimate AIAN and non-AIAN components.
Off-spine entities are those geographic entities that are not directly estimated as part of the TDA spine. Using the standard census tabulation spine (United States, state, county, tract, block group, block), many legally defined off-spine entities are far away from the tabulation spine. The relevant distance measure is the minimum number of on-spine entities added to or subtracted from one another in order to define the off-spine entity, which we call the off-spine distance. There is an inverse relation between off-spine distance and count accuracy for off-spine entities. As an entity’s off-spine distance increases, the accuracy of its estimated characteristics decreases due to the accumulation of independent random noise from the DP mechanism applied to each of these components. Our approach to optimizing the geographic spine is discussed in detail in Abowd, Ashmead, Cumings-Menon, Kifer, et al. (2021). In this section, we give the provide the main details.
In order to improve the accuracy of legally defined off-spine entities, we designed an algorithm to reconfigure the standard tabulation spine for use within the TDA. This was done in two ways. First, we created custom block groups, which are regroupings of the tabulation blocks such that off-spine entities are closer to the spine and group quarters facilities of each type are isolated from their surrounding housing-unit only areas. Isolating group quarters in this manner improves the accuracy of estimated characteristics in both the group quarters and surrounding housing-unit only blocks. Second, we bypass parent geographies with only a single child, combining the privacy-loss budget for the child with that of the parent, which avoids wasting privacy-loss budget on redundant measurements and yields more accurate estimates. The accuracy gain occurs because the algorithm proceeds in a top-down fashion; hence, in the case of a single child geography all measurements must exactly equal those of the parent. Additional privacy-loss budget used on the child after the parental estimation is wasted. This optimization step effectively collapses the parent and child into a single node when the parent’s fan-out is one and uses the combined privacy-loss budget of the parent and child to take DP measurements once for this node.
We applied spine optimization after defining the AIAN spine. Hence, the AIAN areas and the balance of the state were optimized in separate branches of the hierarchy. To implement spine optimization, the Geography Division provided a list of the major sub-county governmental entities within each state. For portions of strong Minor Civil Division (MCD) states outside of AIAN areas, these entities were MCDs. In all other areas, the major sub-county entities were a combination of incorporated and census-designated places. As with states, counties, and tracts using the AIAN spine, these substate entities were divided into the part within an AIAN area and the balance of the substate entity. Call these substate entities the targeted off-spine entities. We minimized the off-spine distance for the targeted off-spine entities using a greedy algorithm that created custom block groups to reduce the number of on-spine entities required to build the targeted off-spine entities. The custom block groups replaced tabulation block groups within TDA; however, the official tabulation block groups are used for all reported statistics. That is, custom block groups are an implementation detail for TDA, not a replacement for tabulation block groups. The spine optimization algorithm could also have created custom tract groups, but this feature was not implemented for the redistricting data.
Table 2 shows selected statistics for geographic entities at each level of the hierarchy after applying the spine optimization to the AIAN spine respectively for 2010 and 2020. The number of entities for state, county, tract and block group differ from the number of tabulation tabulation geographic entities because the AIAN spine subdivides geographies with AIAN areas. Effectively, this moves tabulation counties, tracts, and block groups slightly off spine in order to move AIAN areas, MCDs, and census places closer to the spine.
2010 | 2020 | |
---|---|---|
States with AIAN areas on the Spine (FIPS Codes) | 01, 02, 04, 06, 08, 09, 12, | 01, 02, 04, 06, 08, 09, 12, |
Entities State Level | 87 | 88 |
Entities County Level | 3,488 | 3,496 |
Entities Tract Level | 73,180 | 84,589 |
Tabulation Block Groups | 217,740 | 239,780 |
Custom Block Groups** | 402,020 | 409,548 |
Tabulation Blocks | 11,078,297 | 8,132,968 |
Tabulation Blocks with Occupied GQs | 112,956 | 126,723 |
Tabulation Blocks with | 6,379,963 | 5,879,049 |
Tabulation Blocks with Potential Positive Population | 6,398,202 | 5,892,698 |
*Rhode Island has an American Indian and Alaska Native (AIAN) area in 2020; however, it is not included in the list of states with AIAN areas on the spine because there are no housing units or occupied Group Quarters (GQs) in this AIAN area. Therefore, the blocks within this AIAN area are not included in the spine at all because they have a zero probability, a priori, of containing positive resident population.
**Custom block groups (CBG) used within the TopDown Algorithm (TDA) differ from tabulation block groups. These differences improve accuracy for off-spine geographies like places and minor civil divisions. The use of CBG for measurement and postprocessing within TDA does not impact how the resulting data are tabulated. All 2020 Census data products will be tabulated using the official tabulation block groups as defined by the Census Bureau’s Geography Division.
The TDA implementation within the DAS consists of two primary parts: an underlying framework and the specialization of the framework for the implementation in 2020 Census.
The DAS framework is a set of 22 Python source files (2,814 lines of code and 1,125 lines of unit tests) that provides a generic framework for implementing a batch data processing system. The framework provides for explicit modules that read a configuration file and then use the information in that file to initialize the remainder of the application, read the input data (the reader), perform the differential privacy computation (the engine), validate the results of the computation (the validator), and write out the results (the writer). The actual reader, engine, validator, and writer are implemented by their own Python classes that are also specified in the configuration file. Each module can in turn read additional parameters from the configuration file. Thus, the behavior of the entire system is determined by the configuration file, making it easy to use the framework for both algorithmic experimentation as well as production by specifying different configuration files.
The TDA for the 2020 Census is written within the DAS Framework. It consists of 461 Python source files (78,428 lines of program files and 11,513 lines of unit tests). The system includes 10 readers, 20 writers, and multiple engines. Only one reader, writer, and engine are used for any given run. To document the provenance for each TDA run, the DAS framework automatically determines from the configuration file the specific Python files required and saves this information in XML and in Portable Document Format (PDF) in the DAS Certificate. The system also creates an archive containing the complete source code run. All TDA run data and metadata are archived in the AWS Simple Storage Service (S3) with vintage tags.
All DP mechanisms implemented in TDA require a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and housing unit tables. We generate random bits in the AWS environment using the Intel RDRAND instruction mixed with bits from /dev/urandom (Garfinkel & Leclerc, 2020). Because we used an implementation of the Canonne et al. (2020, 2021) exact sampling approach for the discrete Gaussian mechanism, and allocated privacy-loss budgets using rational numbers, our implementation avoids the vulnerabilities often associated with floating-point random numbers as noted by Mironov (2012).
The 2010 Census data represent the best source of real data for conducting utility experiments relevant to the 2020 census without using the 2020 data. Starting in October 2019, the Census Bureau released a series of demonstration data products based on the 2010 data (U.S. Census Bureau, 2021d). Each set consisted of privacy-protected microdata files (PPMF) generated from the microdata detail file (MDF) that supported at least the redistricting data tables. When released, each demonstration data product represented the current state of DAS development.
It is not possible to tune a formally private disclosure avoidance system by directly using the confidential data it is designed to protect. Doing so causes the values of the tuning parameters—the privacy-loss budget allocations and the query strategies—to leak information about the confidential data. This is the reason that the TDA was not tuned using the 2020 Census Edited File. It is also the reason why quality assurance of the MDF is a difficult conceptual task. Historically, the disclosure avoidance methods used for the decennial census were tuned using data from the same census. Record swaps were rejected if the resulting swapped data did not meet particular accuracy standards. For historical reasons and because of errors associated with the microdata requirement, quality assurance for the 2020 DAS MDF does not strictly adhere to the tenets of formal privacy. The MDF receives human review examining it for errors that look unusual, surprising, or unacceptable compared to the 2020 Census Edited File. Notification of such errors may be escalated up the organizational hierarchy for review and decision-making. As noted in Section 10, there were no such errors flagged in the review of the production redistricting data; however, this situation highlights the need for formally private techniques for quality assurance.
The demonstration data product dated April 28, 2021, was the final preproduction release. This PPMF consisted of person- and housing-unit–level microdata with two global privacy-loss budgets:
Table 3 gives the proportion of the
Geographic Level | Persons | Housing Units |
---|---|---|
U.S. | 51/1024 | 1/1024 |
State | 153/1024 | 1/1024 |
County | 78/1024 | 18/1024 |
Tract | 51/1024 | 75/1024 |
Custom Block Group* | 172/1024 | 906/1024 |
Block | 519/1024 | 23/1024 |
*The custom block groups used within TDA differ from tabulation block groups.
Tables 4 and 5 show the proportions of the total
The set of queries and the values used per geographic level and per query were determined by a set of internal experiments that took place over a period of approximately five months, beginning in December 2020 and ending in early April 2021. These internal experiments estimated the minimum global privacy-loss budget and its allocation to specific queries in order to ensure fitness-for-use in the redistricting application. See Wright and Irimata (2021) for details. These allocations were reflected in the April 2021 demonstration data product. Subsequent analysis by internal and external stakeholders addressed accuracy criteria for other, primarily demographic, uses. These allocations are reflected in the final production settings.
The initial, systematic tuning phase specified a quantitative metric targeted at the redistricting use case: for at least 95% of geographic units, the largest of a selected set of CENRACE
The TDA has error distributions that are data-dependent in a fashion that cannot be expressed in closed form and depends on confidential data values that cannot be published. Therefore, we performed the fitting to the redistricting use-case iteratively rather than using the ex ante properties of zCDP. We used six candidate query strategies (queries for which formally private measurements are taken), created by defining two dimensions. First, query strategies varied by either using measurements that closely conformed to those in the publication specifications for the redistricting data product, or by collapsing/excluding certain marginal queries to try to take advantage of dependence among queries. Second, because both formal privacy and certain measures of error in TDA’s estimation are especially sensitive to cells with small counts, we assigned budgets to queries in one of three ways: proportional to the square of query cardinality, inversely proportional to the square of query cardinality, or uniformly assigned. Crossing these two dimensions generated six candidate strategies. Optimization passes were ordered heuristically, with queries assigned large relative budgets typically isolated in their own passes. In the strategies with almost all privacy-loss budget assigned to the DETAILED query (proportional to the square of query cardinality), in particular, queries were optimized simultaneously, in a single pass.
Strategies that assigned weight according to the inverse-square of query cardinality were quickly ruled out as noncompetitive. For the remaining four strategies, we iterated experimentally, geographic level by geographic level starting at the top with the United States. First, using binary search, for each geographic level, we identified the approximate assignment of
The April 2021 demonstration data product used the
Query | Per Query | |||||
---|---|---|---|---|---|---|
U.S. | State | County | Tract | CBG* | Block | |
TOTAL (1 cell) |
| |||||
CENRACE (63 cells) | ||||||
HISPANIC (2 cells) | ||||||
VOTINGAGE (2 cells) | ||||||
HHINSTLEVELS (3 cells) | ||||||
HHGQ (8 cells) | ||||||
HISPANIC × CENRACE (126 cells) | ||||||
VOTINGAGE × CENRACE (126 cells) | ||||||
VOTINGAGE × HISPANIC (4 cells) | ||||||
VOTINGAGE × HISPANIC × CENRACE (252 cells) | ||||||
HHGQ × VOTINGAGE × HISPANIC × CENRACE (2,016 cells) |
*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm (TDA)
**The TOTAL query (total population) is held invariant at the state level. This
Query | Per Query | |||||
---|---|---|---|---|---|---|
U.S. | State | County | Tract | CBG* | Block | |
Occupancy Status (2 cells) | 1 | 1 | 1 | 1 | 1 | 1 |
*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm (TDA).
For the production run of the 2020 Census Redistricting Data (P.L. 94-171) Summary File, a total privacy-loss budget of
Geographic Level | Person | Housing Units |
---|---|---|
U.S. | 104/4099 | 1/205 |
State | 1440/4099 | 1/205 |
Country | 447/4099 | 7/82 |
Tract | 687/4099 | 364/1025 |
Custom Block Group* | 1256/4099 | 1759/4100 |
Block | 165/4099 | 99/820 |
*Custom block groups differ from tabulation block groups and are only used by the TopDown Algorithm.
Query | Per Query | |||||
---|---|---|---|---|---|---|
U.S. | State | County | Tract | CBG* | Block | |
TOTAL (1 cell) | ||||||
CENRACE (63 cells) | ||||||
HISPANIC (2 cells) | ||||||
VOLTINGAGE (2 cells) | ||||||
HHINSTLEVELS (3 cells) | ||||||
HHGQ (8 cells) | ||||||
HISPANIC × CENRACE (126 cells) | ||||||
VOTINGAGE × CENRACE (126 cells) | ||||||
VOTINGAGE × HISPANIC (4 cells) | ||||||
VOTINGAGE × HISPANIC × CENRACE (252 cells) | ||||||
HHGQ × VOTINGAGE × HISPANIC × CENRACE (2,016 cells) |
*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm.
Query | Per Query | |||||
---|---|---|---|---|---|---|
U.S. | State | County | Tract | CBG* | Block | |
Occupancy Status (2 cells) | 1 | 1 | 1 | 1 | 1 | 1 |
*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm.
There are hundreds, if not thousands, of possible accuracy metrics that could be shown in order to assess the utility of the TDA for the redistricting and general demographic use cases. Here, we show just a few. The complete set of detailed statistical metrics for the redistricting use case can be found in Wright and Irimata (2021). We summarize here. First, for census-designated places and target off-spine entities, including off-spine entities in both the AIAN and non-AIAN branches of the hierarchy, the production settings of TDA achieve the
Figures 2 through 5 are good representations of the improvement in the TDA over time including those found in the tuned April 2021 release and the final production settings. The figures illustrate four different accuracy metrics; however, it is important to remember that the statistical optimization problems in Section 7 focus on
To further study the effect of certain design choices on the utility of the results, we conducted an additional series of experiments after the data release using the 2010 redistricting data. We considered the final production settings (see section 8.2 for full details) as a baseline and completed experimental runs to study the marginal effects of 1) randomness 2) increasing
Tables 9 and 10 show the absolute error across selected queries averaged across the geographic level of interest. The absolute error for a given query can be represented
When comparing the different runs in Tables 9 and 10, we see that the average errors for baseline1 and baseline2 are very consistent with each other, especially at the lower levels of geography where we are averaging over a large number of geographic units. While, ideally, average errors could be established by averaging over many runs of the TDA, a single run is extremely computationally expensive. These results show that in evaluating this type of average errors, it is not necessary to look across many runs. We note however, that errors averaged across geographies should not be used as a direct substitute for the run-to-run variability in an individual geography’s query error. This is because the underlying error distributions are not necessarily identical across geographic units. We are actively researching better methods for summarizing the error distributions for the published TDA statistics.
The effect of increasing the overall
Geographic Level | Query | baseline1 | baseline2 | inc_rho | l2_onepass |
---|---|---|---|---|---|
U.S. | TOTAL | 0.00 | 0.00 | 0.00 | 0.00 |
State | TOTAL | 0.00 | 0.00 | 0.00 | 0.00 |
County | TOTAL | 1.75 | 1.75 | 1.71 | 1.73 |
Tract | TOTAL | 1.94 | 1.93 | 1.84 | 1.88 |
Block Group | TOTAL | 16.00 | 16.01 | 15.42 | 15.99 |
Block | TOTAL | 4.85 | 4.85 | 4.69 | 4.85 |
U.S. | VOTINGAGE | 70.00 | 92.00 | 64.00 | 112.00 |
State | VOTINGAGE | 29.69 | 25.57 | 24.08 | 27.14 |
County | VOTINGAGE | 19.67 | 19.82 | 19.04 | 19.62 |
Tract | VOTINGAGE | 15.00 | 15.10 | 14.47 | 15.10 |
Block Group | VOTINGAGE | 21.35 | 21.40 | 20.57 | 21.34 |
Block | VOTINGAGE | 6.23 | 6.23 | 6.01 | 6.23 |
U.S. | HISPANIC | 28.00 | 24.00 | 18.00 | 14.00 |
State | HISPANIC | 24.24 | 28.00 | 27.25 | 27.02 |
County | HISPANIC | 20.29 | 20.81 | 19.48 | 20.13 |
Tract | HISPANIC | 8.11 | 8.12 | 7.81 | 8.06 |
Block Group | HISPANIC | 20.24 | 20.27 | 19.56 | 20.21 |
Block | HISPANIC | 5.82 | 5.82 | 5.63 | 5.82 |
Geographic Level | Query | baseline1 | baseline2 | inc_rho | l2_onepass |
---|---|---|---|---|---|
U.S. | CENRACE | 698.00 | 842.00 | 666.00 | 666.00 |
State | CENRACE | 393.45 | 392.55 | 384.75 | 399.76 |
County | CENRACE | 126.42 | 126.45 | 122.00 | 127.25 |
Tract | CENRACE | 35.04 | 35.02 | 33.83 | 35.04 |
Block Group | CENRACE | 41.36 | 41.39 | 40.17 | 41.39 |
Block | CENRACE | 8.04 | 8.05 | 7.80 | 8.04 |
U.S. | HISPANIC × CENRACE | 882.00 | 1,054.00 | 930.00 | 890.00 |
State | HISPANIC × CENRACE | 562.00 | 558.31 | 552.75 | 563.53 |
County | HISPANIC × CENRACE | 164.59 | 164.69 | 159.38 | 165.03 |
Tract | HISPANIC × CENRACE | 43.64 | 43.65 | 42.15 | 43.67 |
Block Group | HISPANIC × CENRACE | 49.48 | 49.50 | 48.03 | 49.45 |
Block | HISPANIC × CENRACE | 8.94 | 8.94 | 8.67 | 8.94 |
U.S. | DETAILED* | 4,456.00 | 4,332.00 | 4,176.00 | 4,212.00 |
State | DETAILED | 1,568.27 | 1,573.18 | 1,507.14 | 1,572.35 |
County | DETAILED | 294.45 | 293.71 | 284.47 | 294.50 |
Tract | DETAILED | 95.33 | 95.29 | 91.96 | 95.29 |
Block Group | DETAILED | 70.18 | 70.18 | 68.03 | 70.14 |
Block | DETAILED | 10.93 | 10.93 | 10.59 | 10.93 |
*DETAILED is shorthand for HHGQ × VOTINGAGE × HISPANIC × CENRACE.
Table 11 shows the error distribution of the TOTAL query for each of the experimental runs at the county-level grouping by the underlying CEF total population. The table shows both the mean absolute error as well as specific signed error quantiles for each of the runs and total population groupings. In general at the county level, there is a small increase in the average error as the total population increases. The error distributions are fairly symmetric and centered around zero for all but the largest of counties. In those counties there appears to be slightly larger likelihood of having a negative error; however, the errors are quite small relative to the overall population of the county. Comparing the different runs, there is not much variability in these metrics between our baseline runs or between the baseline runs and inc_rho or l2_onepass.
Considering comparable metrics at the block level (Table 12), we see that overall, the differences between the error distributions by total population are more pronounced. Blocks with small populations are on average slightly overestimated while blocks with larger populations are on average slightly underestimated by the TDA. This is directly tied to the nonnegativity query requirement and much more pronounced at the block level because there are many blocks with very small total populations. Comparing the different runs, we see that inc_rho had consistently lower errors than the others at this geographic level.
Total Population | Count | Metric | baseline1 | baseline2 | inc_rho | l2_onepass |
---|---|---|---|---|---|---|
0 to 1,000 | 35 | mean L1 | 1.314 | 0.943 | 1.314 | 1.343 |
q(0.005) | -3 | −2 | −3 | −3 | ||
q(0.025) | -3 | −2 | −3 | −3 | ||
q(0.25) | -1 | −1 | −1 | 0 | ||
q(0.5) | 0 | 0 | 1 | 0 | ||
q(0.75) | 1 | 1 | 1 | 2 | ||
q(0.975) | 3 | 3 | 3 | 4 | ||
q(0.995) | 3 | 3 | 3 | 4 | ||
1,001 to 9,999 | 663 | mean L1 | 1.611 | 1.528 | 1.596 | 1.558 |
q(0.005) | −5 | −5 | −6 | −5 | ||
q(0.025) | −4 | −4 | −4 | −4 | ||
q(0.25) | −1 | −1 | −1 | −1 | ||
q(0.5) | 0 | 0 | 0 | 0 | ||
q(0.75) | 1 | 1 | 1 | 2 | ||
q(0.975) | 4 | 5 | 4 | 4 | ||
q(0.995) | 6 | 6 | 5 | 6 | ||
10k to 99,999 | 1867 | mean L1 | 1.77 | 1.819 | 1.736 | 1.774 |
q(0.005) | −6 | −6 | −6 | −6 | ||
q(0.025) | −4 | −4 | −4 | −4 | ||
q(0.25) | −1 | −2 | −2 | −2 | ||
q(0.5) | 0 | 0 | 0 | 0 | ||
q(0.75) | 2 | 2 | 1 | 2 | ||
q(0.975) | 5 | 5 | 4 | 4 | ||
q(0.995) | 6 | 6 | 6 | 6 | ||
100k to 999,999 | 539 | mean L1 | 1.883 | 1.818 | 1.753 | 1.761 |
q(0.005) | −8 | −7 |