The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacyloss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zeroConcentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacyloss accounting. The TDA postprocesses the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94171) Summary File. This article describes the mathematics and testing of the TDA for this purpose.
Keywords: differential privacy, 2020 Census, TopDown Algorithm, redistricting data
Differential privacy (Dwork, McSherry, et al., 2006) (henceforth DP) is considered the gold standard in privacyprotected data publication—it allows organizations to collect and publish statistics about groups of people while protecting the confidentiality of their individual responses. Initially adopted by the U.S. Census Bureau in 2008 for the OnTheMap product (Machanavajjhala et al., 2008; U.S. Census Bureau, 2008), it has since seen development by Google (Bittau et al., 2017; Erlingsson et al., 2014), Apple (Apple Differential Privacy Team, 2017), Uber (Johnson et al., 2018), and Microsoft (Ding et al., 2017). This article describes the implementation of the TopDown Algorithm, an algorithm developed within the differential privacy framework for the 2020 Census of Population and Housing in the United States (Abowd, 2018b).
The 2020 Census implementation will be among the largest deployments of differential privacy using the trustedcurator (centralized) model and, arguably, will have the highest stakes of any deployed formal privacy system, since decennial census data are used for apportionment, redistricting, allocation of funds, public policy, and research. The TopDown Algorithm (TDA) is the name given to the DP mechanisms, optimization algorithms, and associated postprocessing that are used for research and production. TDA generates confidentialitypreserving person and housingunit–level data called microdata detail files (MDF) with demographic and housingunit information from the resident populations of the United States and Puerto Rico. We refer to the system that carries out the TDA as the 2020 Census Disclosure Avoidance System (DAS). The DAS is the collection of computer programs that implements statistical disclosure limitation for the census. It replaces the recordswapping system used from 1990 to 2010 for decennial censuses.
This article provides an overview of the DAS and TDA, discusses the motivation for adopting DP as the privacyloss accounting framework as compared to other disclosure avoidance frameworks, and summarizes the critical policy considerations that motivated specific aspects of the design of TDA. We focus on the implementation of TDA that the Census Bureau used to release the 2020 Census Redistricting Data (P.L. 94171) Summary File (redistricting data, hereafter) (Final Content Design for the Prototype 2020 Census Redistricting Data File, 2018). We discuss the utility of the TDA, but leave for future work a discussion of how users can best analyze the released data and how the privacy semantics—the interpretation of DP—are modified in the presence of invariants.
The remaining sections of the paper are organized as follows. Section 2 lays out the rationale for adopting DP as the disclosure avoidance framework for the 2020 Census. Section 3 discusses the overarching policy and practical constraints that governed the implementation. Section 4 discusses the many sources of uncertainty that are inherent in producing census population data. Section 5 provides the schema, invariants (tabulations that are not processed by the differentially private mechanisms), and constraints for the specification of the redistricting data, the first data product released using the methods described in this article. Section 6 provides an overview of the DP mechanism implemented in the TDA as well as the relevant mathematical properties. Section 7 goes into detail about estimation routines and improvements to the algorithm. Section 8 describes the tuning and testing of the DAS over time. Section 9 describes a set of experiments carried out to study the effect of certain design choices on the utility of the results using the 2010 redistricting data. Section 10 concludes.
The decennial census data are used to apportion the House of Representatives, to allocate at least $675 billion of federal funds every year,1 and to redistrict every legislative body in the nation.2 The accuracy of those data is extremely important. The Census Bureau is also tasked with protecting the confidentiality of the respondents and the data they provide. This dual mandate presents a challenging problem because data accuracy and data privacy are competing objectives (Abowd & Schmutte, 2019). For any given privacyprotection procedure, more accuracy means less privacy, and the theory underlying differential privacy and related research have helped to quantify that tradeoff. We use the term formally private to encompass frameworks that include the many variants of differential privacy. Pure differential privacy, approximate differential privacy, and concentrated differential privacy are all examples of formally private methodologies. A disclosure avoidance framework is formally private if it is constructed from randomized mechanisms whose properties do not depend on the realized confidential data nor, ideally, on limitations on the attacker’s information set. A formally private framework must provide an accounting system that quantifies the privacy loss associated with the ensemble of published query answers. TDA uses the formal privacy framework defined by zeroConcentrated Differential Privacy, which will be defined below.
There are several reasons why the Census Bureau introduced formally private methodology for disclosure avoidance. First, the vulnerabilities of traditional disclosure limitation methods are well known among privacy researchers and legacy systems have been attacked in various ways (BarthJones, 2012; Cohen & Nissim, 2020; Dinur & Nissim, 2003; Dobra & Fienberg, 2000; Dwork et al., 2017; Fienberg & Slavkovic, 2005;x Garfinkel, 2015; Hansell, 2006; Homer, 2008; Kifer, 2009; Narayanan & Shmatikov, 2008; Sweeney, 2002; Wong et al., 2007). Access to petabytescale cloudbased computing resources and software libraries designed to use these resources has increased enormously. At the same time, the amount of commercially available or independently held data on individuals that could be used as auxiliary information in order to reidentify individuals in Census Bureau products has also exploded. Unlike the Census Bureau, which has operated under a strict data confidentiality statute since 1954 (Title 13  Census, 1954), internetbased data aggregators like Apple, Facebook, Google, LinkedIn, Microsoft, Uber, Twitter, and many others only recently became subject to tight privacy regulation in the form of the California Consumer Privacy Act (Consumer Privacy Act, 2018) and the European Union General Data Protection Regulation (General Data Protection Regulation, 2016). The presence of vast arrays of personal information held by private companies and state actors in combination with software like Gurobi, CPLEX and GLPK designed to solve many NPhard systems of billions of simultaneous equations finally realized the scenario that national statistical agencies have known was a vulnerability since Ivan Fellegi’s original paper on data confidentiality (Fellegi, 1972).
Formal methods like differential privacy were developed specifically to avoid the vulnerabilities of traditional disclosure limitation methods. Furthermore, in contrast to the previous disclosure avoidance methods used for decennial censuses, the exact methodology and parameters of the randomized mechanisms that implement differential privacy are transparent, meaning an organization can release the source code and parameters of the mechanisms, but not the actual random numbers used, without compromising the privacy guarantees (Dwork, McSherry, et al., 2006).
In addition to transparency, two other key properties make DP methods attractive compared with traditional disclosure avoidance methods. These are the absence of degradation under postprocessing and adaptive composition. First, postprocessing (Dwork, McSherry, et al., 2006; Dwork & Roth, 2014; McSherry, 2009) means that if we run an algorithm ${A}$ with no direct access to the confidential data $X$ on the output of the differentially private mechanism ${M}(X)$, then the composed algorithm ${A(M}(X))$ also satisfies differential privacy with no additional privacy loss. Second, differential privacy has an adaptive composition property. If we use a DP mechanism $M_1(X)$ to produce an output $\omega$ and a second DP mechanism $M_2(\omega, X)$, then the composed output is also differentially private and the total privacyloss of the composed mechanism is a subadditive and smoothly bounded function of the privacy loss of $M_1$ and $M_2$ (Dwork & Roth, 2014; Murtagh & Vadhan, 2016).
To the best of our knowledge, no traditional disclosure avoidance method satisfies transparency, nondegradation under postprocessing, and composition (Abowd & Schmutte, 2015). In particular, the household recordswapping method used in the 1990, 2000, and 2010 Censuses (McKenna, 2018) is not transparent and degrades poorly under composition. Its parameters and the details of the swapping algorithm cannot be published without compromising the privacy guarantee. Aggregation, which was combined with swapping in previous censuses, does not preserve uncertainty about the underlying data under postprocessing because it allows reidentification via reconstruction of the microdata (Dinur & Nissim, 2003; JASON Group, 2020). DP mechanisms also degrade under composition but in a predictable and smoothly bounded manner, thus avoiding highly accurate database reconstructions if agencies reasonably control their expended privacyloss budget. DP mechanisms do not degrade under postprocessing. For the traditional disclosure limitation method of cell suppression (Cox, 1980; Cox et al., 1987), an arbitrarily large part of the suppression pattern can be unraveled if an attacker knows the value of a single cell that the system assumed was not known, which can occur either as prior knowledge or because some other group did a cellsuppressed release of overlapping data but didn’t suppress the same cells. Cell suppression has formal guarantees, but they only hold against an attacker who knows exactly what was published and nothing else relevant. Suppression breaks down in a brittle, discontinuous way when the attacker knows more. Such knowledge may not exist when the data are published, but may enter the public domain at a later point in time. In contrast, formal privacy provides guarantees against general rather than specific classes of attackers. While the protection of formally private systems also degrades with uncoordinated independent data releases, this degradation is smoothly bounded and fully quantified by the cumulative global privacyloss budget.
An important aspect of formal privacy is the acknowledgment that the attacker’s knowledge plays a large role in disclosure avoidance and that assumptions about this knowledge are inherently imperfect. Thus, the definitions used in formal privacy systems attempt to limit the extent to which the protections depend upon the specifics of the attacker’s prior information. These limitations usually take the form of worstcase analysis and in that sense can be viewed as a minimax approach to a class of privacy protection problems (Wasserman & Zhou, 2010).
Rather than attempting to make statements about the absolute privacy risk, which requires specifying the attacker’s information set and strategy, DP shifts the focus to relative privacy guarantees, which hold for arbitrary attacker prior beliefs and strategies. For example, rather than saying Alice’s data are not reidentifiable given the output of the DP mechanism, we say that Alice’s incremental risk of reidentification is measured by the difference between the output of the DP mechanism that did not include Alice’s data as input versus the same one that did. An attacker or user should be able to make a statistical inference about Alice—one that depends on aggregate data—but not an identifying inference—one that is enabled only by the presence of Alice’s information as input to the DP mechanism. The mathematical analysis that quantifies these limitations on identifying inferences is called privacy semantics (Balle et al., 2020; Dong et al., 2022; Kairouz et al., 2015; Kasiviswanathan & Smith, 2014; Wasserman & Zhou, 2010).
There are multiple definitions of formal privacy, each having slightly different motivation and mathematical primitives. We use two specific formalizations as part of the TDA. The primary definition is that of approximate differential privacy or $(\epsilon,\delta)$differential privacy. The parameter $\epsilon$ is commonly referred to as ‘the privacyloss parameter’ for this definition. The parameter $\delta$ represents the amount of departure from pure differential privacy and is typically set at values that are smaller than $1/N$, where $N$ is the number of records in the data set, because a mechanism that releases a record at random satisfies $(0, 1/N)$ differential privacy. The second definition is called zeroConcentrated Differential Privacy (zCDP), which implies $(\epsilon,\delta)$differential privacy and has very robust privacyloss accounting tools and composition properties. Concentrated differential privacy was introduced by Dwork and Rothblum (2016) to provide tighter analysis of privacy loss due to composition. We implement the variant defined by Bun and Steinke (2016a, 2016b), which can be used to provide a complete privacyloss accounting framework using the discrete Gaussian distribution for noise addition (Canonne et al., 2020, 2021).
The reason for specifying two privacy definitions is that, during the course of implementing the TDA, we observed substantial accuracy improvements from adopting a discrete Gaussian mechanism (Canonne et al., 2021) for the generation of noisy measurements versus the geometric mechanism (Ghosh et al., 2012). While we quantify our privacy semantics using approximate differential privacy, privacyloss accounting using the discrete Gaussian mechanism more naturally aligns with zCDP—allocating $\rho$ rather than $\epsilon$. Therefore, we adopted the zCDP privacyloss accounting for TDA. To permit comparison to other DP implementations and because reasoning about the complete distribution of the privacyloss random variable rather than its tail area is more difficult to communicate to policymakers, we translate the zCDP privacy parameter $\rho$ into the corresponding $(\epsilon, \delta)$ values for approximate differential privacy. When converting $\rho$ to $(\epsilon, \delta)$, we intentionally did not use the tightest conversion (Canonne et al., 2020, 2021), but instead chose to use a conversion (Bun & Steinke, 2016b) that overestimates $\delta$ and is interpretable as an upper bound on the probability that a particular level of $\epsilon$ does not hold (Asoodeh et al., 2020).
Rather than determining a single $(\epsilon, \delta)$ pair, zCDP produces a continuum of $(\epsilon, \delta)$ pairs corresponding to a particular value of $\rho$, and all satisfying Definition 1. The privacyloss accounting is done using $\rho$. See Figure 1 for examples of the $(\epsilon, \delta)$ curve based on zCDP values of $\rho=1$ and $\rho=2.63$ (the final production setting of TDA for the redistricting data). When $\delta = 0$, Definition 1 is that of pure differential privacy (Dwork et al., 2006). The curve in Figure 1 is based on the privacyloss random variable and indicates that zCDP provides privacy protection that weakens as $\delta$ approaches 0, but never results in catastrophic failure, which some approximate differential privacy mechanisms do permit.4 No single point on Figure 1 summarizes the value of $\rho$ completely. In public presentations of the work, the Census Bureau has provided the $(\epsilon, \delta)$ pairs associated with $\delta=10^{10}$ to facilitate the transition from $(\epsilon, \delta)$ privacyloss accounting to rhobased privacyloss accounting. However, a privacy analysis of only this $(\epsilon, \delta)$ pair would lead to a very incomplete analysis of the privacy protections provided by a given setting of $\rho$. We do not attempt to further interpret the $\delta$ parameter in this work.
Definition 1 describes a property of the output distribution of DP mechanism ${M}$. It characterizes a guarantee that similar input databases will produce approximately the same output distribution. That is, an adversary cannot learn much more about the specific data of an arbitrary individual in the database from the mechanism’s output than if the record for that individual had been removed and replaced with an arbitrary default set of values (Kasiviswanathan & Smith, 2014). The translation to confidentiality protection comes from the argument that, if a person’s record is first changed to a default set of values, then their confidentiality is protected because their data were never used in any published statistic. The degree to which the DP mechanism’s output distribution is close to this private counterfactual depends on the privacyloss parameter $\rho$ for zCDP or $\epsilon$ for $(\epsilon,\delta)$DP. The parameter $\delta$ obtained from the conversion from $\rho$zCDP is the amount by which the pure DP guarantees are violated. For zCDP, $\rho$ provides a continuum of $(\epsilon, \delta)$ pairs with these interpretations. DP mechanisms that implement zCDP can be tuned to balance the data accuracy and privacy loss of a particular application. In the case of the 2020 Census and other government statistical applications, setting these values is (or should be) the job of policymakers. TDA was designed to permit policymakers, in particular the Census Bureau’s Data Stewardship Executive Policy Committee (DSEP), to set the global privacyloss parameter and its allocation to each query. The typical goal is to achieve the maximum accuracy for a given privacyloss parameter ($\rho$ or $\epsilon$). However, DSEP’s instructions generally tried to minimize the privacyloss parameter given a lower bound on utility (see Section 6).
The data curator is the organization that collects and retains the data to be protected, often in a formal database. The 2020 Census application of DP uses the trustedcurator model, sometimes called the centralized DP model, meaning that a legally authorized entity (the U.S. Census Bureau) collects and stores the confidential data and is responsible for the release of tabulations (queries) based on those data, applying appropriate confidentiality protection techniques on the tabulations. This is in comparison with the untrusted curator model, or local DP model, in which confidentiality protection techniques are applied to individual data records as they are collected and thus the curator never observes the confidential data directly, only through the output of the local DP mechanism. We point out the differences between these two models because the DP methods associated with them are necessarily different. An early example of a privacyprotection method in the untrusted curator model is randomized response (Warner, 1965) in which the respondent answers a question truthfully with some predetermined probability, but otherwise supplies a fixed answer (or answers a different question). The advantage of the trusted curator model is that it allows for DP methods that are more statistically efficient, meaning that the results will be more accurate given the same privacyloss parameter.
In designing disclosure avoidance methods for the 2020 Census, several practical and policy considerations played a large role. Decennial census products are generally tables with counts of individuals, households, group quarters residents, or housing units with certain characteristics. The tabulation systems in place for the 2020 Census assume that the input is an individual or housingunit recordlevel data set in which each row represents a person or a particular housing unit. This type of data set is often called microdata. In order to align with the tabulation system, the design specifications for TDA mandated the production of microdata—a recordlevel image for each person (or housing unit) that, when tabulated, produces the protected query answers (not their confidential counterparts).
Producing privacyprotected microdata is not a common requirement for disclosure avoidance systems implemented using the centralized DP model, which are often designed to produce an unbiased noisy measurement of the answer to the original confidential query according to the use case for the queries. An important implication of the microdata requirement is that it forces the final output of TDA to be nonnegative and integral because there is no accommodation for a negative weight or fractional person in the Census Bureau’s microdata tabulation systems.5 From a practical point of view, this makes complete sense. Some might find it strange if the Census Bureau reported that there were $1.3$ persons with characteristic $A$ in geography $B$. Any disclosure avoidance system that allows for changing counts that are zero to nonzero values and enforces nonnegativity, including suppression and swapping, suffers from systematic bias (Abowd & Schmutte, 2015).
There are two ways that the microdata requirement could be met by the disclosure avoidance system. The differentially private mechanism could directly output data consistent with the microdata requirement. Alternatively, the output of the differentially private mechanism could be postprocessed to conform to the microdata requirement. The first approach could theoretically be achieved by using the exponential mechanism (McSherry & Talwar, 2007); however, that would require the mechanism to use the fully specified output range schema because the exponential mechanism relies on sampling from that range. For our application, that approach was infeasible because it required enumerating the discrete elements of the output space that simultaneously met the microdata, invariant, edit constraint, and structural zero requirements. Not only would this have required a new theoretical apparatus to solve, but during the course of the development of the DAS the requirements changed several times. Instead, the DAS uses postprocessing to impose these requirements hierarchically using a preexisting framework from the operations research literature, specifically total unimodularity (Nemhauser & Wolsey, 1999, Sections III.1 and III.2).
There are some noticeable downstream consequences of imposing nonnegativity. If the confidential count of characteristic $A$ in geography $B$ is $0$ and the TDA output of that count must be nonnegative, then the microdatabased estimator, constrained to use nonnegative weights, will be positively biased. This applies to both the postprocessing case and the exponential mechanism case above, but note that when using the discrete Gaussian mechanism, the noisy measurements themselves are unbiased. In addition, the tabulation from microdata makes the errors in the disclosure avoidance system data dependent. Controlling the algorithmic biases and the datadependent errors—properties due entirely to postprocessing and not the DP mechanisms—is the hardest implementation challenge for TDA (Abowd, Ashmead, CumingsMenon, Garfinkel, et al., 2021). Inferential methods using a posterior sampling framework and the noisy measurements directly rather than the postprocessed tabulations have shown promising results, though there are computational limitations given the scope of the 2020 Census (Seeman et al., 2020).
A second, but related, requirement for the 2020 Census implementation was internal consistency, which is a natural property of tabulating all statistics from the same recordlevel input data with constant, nonnegative weights (all unity for the 2020 Census). Internal consistency means that a count of interest will be exactly the same if it appears in multiple tables or can be calculated from taking the sum or difference of counts from other tables. Consistency is a consequence of the algorithm producing privacyprotected microdata, but it is possible to have consistency without requiring tabulation from microdata.
A third requirement for the TopDown Algorithm was the implementation of invariants, edit constraints, and structural zeros. By invariants we mean counts that deliberately bypass the DP mechanism and are reported with certainty exactly as they are tabulated from the confidential data. The most notable invariants for the 2020 Census are the state total resident populations because of their essential function in the apportionment of the House of Representatives. Edit constraints and structural zeros are rules determined by Census Bureau subject matter experts, then applied to the confidential data as part of the processing that produces the Census Edited File, one of the two inputs to TDA. An example of an edit constraint is the requirement that a mother be at least some number of years older than her oldest natural child.
The privacy loss associated with the invariants is not quantifiable in the standard DP literature, and therefore the inclusion of invariants in the DAS complicates the interpretation of the privacy guarantees. Instead of going into detail here, we save this topic for a separate paper and note that the privacyloss accounting for the DAS presented in this article ignores any contributions from the invariants.
One of the most important and challenging design considerations for the 2020 Census DAS is taking into account the multidimensional nature of the utility of the census data. While apportionment is perhaps the most important use case, census data are used for hundreds if not thousands of public, private, and commercial applications ranging from redistricting, to population agedistribution estimation, to characteristics of the housing stock. Therefore, the TDA must produce output that satisfies the same edit constraints, where feasible, with controllable accuracy, in the sense of data utility, for all of its use cases. From an economic point of view, algorithmic design and implementation for TDA can be seen as allocating a scarce resource (the finite information content of the confidential data) between competing use cases (different accuracy measures) and respondent privacy loss (Abowd & Schmutte, 2019).
Two excellent examples of competing use cases are redistricting and intercensal population estimates. Redistricting consists of drawing contiguous geographic entities that are mutually exclusive, exhaustive, and divide a political entity into voting districts of approximately equal populations. These districts also must satisfy the antidiscrimination provisions in the 1965 Voting Rights Act (42 U.S. Code § 1973). To satisfy these requirements, nonpartisan redistricting experts work with the Census Redistricting and Voting Rights Data Office (part of the Census Bureau) to designate the smallest unit of geography—the atom from which the districts are assembled—and the taxonomy for race and ethnicity categories, which must comply with Office of Management and Budget Statistical Policy Directive 15 (Office of Management and Budget, 1997). Population estimates require annually updated populations for approximately 85,000 politically and statistically defined geographic areas and age by sex by race by ethnicity tables for all counties in the United States and municipios in Puerto Rico. Accuracy for redistricting targets geographic areas that are not yet defined when the disclosure avoidance is applied for statistics on population in 252 categories (voting age by race by ethnicity). Accuracy for population estimates targets one statistic (total population) in predefined geographies ranging in size from a few hundred to millions of people and about 2,000 statistics (sex by age in single years by major race and ethnicity categories) in counties and municipios. Accuracy for redistricting is improved primarily by allocating privacy loss to the lowest levels of geography (block groups and blocks in census terminology) and the most detailed race and ethnicity summaries. Accuracy for population estimates is improved by allocating privacy loss to more aggregated geographies (counties and tracts) and favoring detail on age over detail on race and ethnicity.
The use of DP mechanisms for the 2020 Census DAS will not be the first time uncertainty is deliberately introduced into the tabular summaries from a decennial census to enhance disclosure avoidance. And in any census, disclosure avoidance uncertainty is not the only source of uncertainty affecting the published statistics. Historically, the decennial census has used data suppression, blank and replace, swapping, and partial synthetic data for disclosure avoidance in the tabular summaries (McKenna, 2018). In addition, sampling and coarsening along with swapping and partially synthetic data were used for publicuse microdata samples (McKenna, 2019). These methods also create uncertainty in inferences using the published data. Suppression creates a nonignorable missing data problem because the suppression rule depends on the size of the statistic being suppressed and on the correlation of the complementary suppressions with primary suppressions. Swapping and partially synthetic data affect inferences because of the randomness used by both procedures (Abowd & Schmutte, 2015). However, the Census Bureau, like most national statistical agencies, does not provide any quantification of the uncertainty due to suppression, swapping, or partial synthesis. Consequently, the effect on inferences from the published data is usually not acknowledged because researchers are unable to quantify it. For example, in the case of traditional swapping, the parameters of the swapping algorithm are secret. In the case of disclosure avoidance systems implementing DP mechanisms, the parameters and distributions of the mechanism’s randomness are public and therefore can be quantified. TDA is an inferencevalid disclosure avoidance system. Its uncertainty can be quantified and that quantification can be used to guide inferences from the published data; though, this can often be nontrivial.
The censustaking process itself can be viewed as a random event. Modern incomplete data inference has formalized the many reasons why a realized census does not produce perfect, complete data. Postenumeration surveys have shown the undercoverage of persons with some characteristics and the overcoverage of persons with other characteristics (Hogan et al., 2013). Additionally, observed data items can be missing or erroneous for some individuals. Given the observed data, missing data techniques are used to produce the Census Edited File (CEF)—the final version of the confidential census data. Statistically, the CEF is the completed census data from a collection of singleimputation missing data models.6 In the CEF, there is a record for every housing unit, occupied group quarters resident, and housing unit resident person alive on April 1, 2020, with all tabulation variables containing valid responses on every record. There is an enormous investment in the technology of specifying this ancillary estimator for the completed census data, given the observed census data, even when there is no adjustment for differential net undercoverage. Finally, disclosure avoidance techniques are applied to these completed census data to produce tabular summaries that protect confidentiality. In this article, we use the CEF as the standard for estimating the accuracy of the DAS output. For better understanding the uncertainty introduced by disclosure avoidance, we compare the DAS output to smallscale simulations that quantify the uncertainty from census operational, measurement, and coverage errors for the 2010 Census (Bell & Schafer, 2021).
The Redistricting Data (P.L. 94171) Summary Files are an essential data product as they are the basis for state redistricting, official population counts, and other statutory uses. The tables are therefore sometimes called the ‘redistricting data’ tables. Before going into additional details, it is important to have a sense of the geographic hierarchy that the Census Bureau uses in the creation of data products.7 The primitive element of the tabulation geographic hierarchy is the census block. All other geographic entities (states, tribal census areas, counties, places, etc.) can be defined as aggregations of blocks. The ‘spine’ of the tabulation geographic hierarchy consists of the United States, states (including the District of Columbia), counties, tracts, block groups, and blocks.8 In each case, the entities at lower levels of geography on the spine aggregate without partitions to entities above them. That is, on the tabulation geographic spine a block belongs in one, and only one, block group; a block group belongs in one, and only one, tract; and so on up the spine. This tabulation geographic spine plays an important role in the TopDown Algorithm. Other geographic aggregations (e.g., congressional districts, ZIP Code tabulation areas, incorporated places, etc.) are considered ’offspine’ because they don’t fit directly onto the tabulation spine’s hierarchy.
The redistricting data tables for 2020 (U.S. Census Bureau, 2020a) are
P1: Race
P2: Hispanic or Latino, and not Hispanic or Latino by race
P3: Race for the population 18 years and over
P4: Hispanic or Latino, and not Hispanic or Latino by race for the population 18 years and over
P5: Group Quarters (GQ) population by major GQ type
H1: Occupancy status.
Each table is produced down to the census block level. In order to tabulate from microdata, it is necessary to consider only a subset of census attributes with mutually exclusive levels. We designate these as the detailed queries for the person and housingunit data, respectively. For person data the following attributes and levels constitute the detailed query to create the redistricting data tables:
Block. 5,892,698 levels9
Household or Group Quarters Type. 8 levels: household, correctional facilities for adults, juvenile facilities, nursing facilities/skillednursing facilities, other institutional facilities, college/university student housing, military quarters, other noninstitutional facilities;
Voting Age. 2 levels: age 17 or younger on April 1, 2020, age 18 or older on April 1, 2020;
Hispanic/Latino origin. 2 levels: Hispanic/Latino, not Hispanic/Latino;
Census race. 63 levels: every combination of Black/African American, American Indian/Native Alaskan, Asian, Native Hawaiian/Pacific Islander, White, and some other race, except “none of the above.”
Excluding block, there are 2,016 mutually exclusive, exhaustive combinations of these attribute levels. Including block, there are 11,879,679,168 levels, of which the number of persons in each must be output by the DP mechanisms so that TDA can be used to tabulate each of the redistricting data personlevel tables.
For housingunit data the following attributes and levels constitute the detailed query to create the redistricting data tables:
Block. 5,892,698 levels;
Occupancy status. 2 Levels: occupied, vacant.
Therefore a total of 11,785,396 mutually exclusive, exhaustive combinations of the attribute levels must be output by the DP mechanisms so that TDA can be used to tabulate each of the redistricting data housingunit tables.
In addition to the measurements associated with the specified tables, the TDA is programmed to bypass the DP mechanisms when instructed by policymakers at the Census Bureau. Queries that bypass DP mechanisms are called invariants. Mathematically, an invariant is equivalent to designating infinite privacy loss for the specified tabulation, putting such queries outside the formal guarantees of differential privacy. In addition, invariants challenge the implementation of TDA because they may force the postprocessing phase of TDA to attempt to solve computationally infeasible problems (see Section 7). As a consequence, the Census Bureau policymakers have specified invariants only when there was either a constitutional mandate or an operational constraint in the conduct of the census.
The constitutional constraint comes from the mandate to apportion the House of Representatives based on the “actual Enumeration” of the population on census day. Technically, this constraint only requires the allocation of the seats in the House to be invariant—not subject to changes due to the confidentiality protection system; however, DSEP determined that setting the populations of the 50 states, District of Columbia, and Commonwealth of Puerto Rico invariant was the most transparent and practical way to accomplish this objective. The policymakers accepted the risk associated with releasing these data outside the enhanced protections provided by differential privacy; that is, aggregation is the only disclosure avoidance applied to the state populations.
The operational constraints come from the methods used to keep the Census Bureau’s Master Address File (MAF)—the frame for the decennial census and many other surveys—up to date. The MAF is a database of known addresses and features or characteristics of these addresses. The most important feature is living quarters, which means that the address designates a domicile capable of housing at least one human being. Fixed domiciles come in two types: housing units, which usually domicile a single household, and group quarters, which usually domicile a group of unrelated individuals.10 There are also living quarters that are not fixed domiciles such as servicebased facilities (soup kitchens, shelters, etc.) and transient locations (tents, campgrounds, etc.). In addition, there are addresses for facilities that are not deemed living quarters at all, such as businesses, highway medians, and fire hydrants.
Maintaining the MAF is a continuous process involving many operations. A few are salient to the disclosure avoidance system. First, by statute, the Census Bureau conducts the Local Update of Census Addresses (LUCA) late in the census decade. The LUCA operation enlists partners at all levels of government—states, recognized tribes, counties, and municipalities—to review the addresses in the MAF. In support of this operation, the Census Bureau publicly distributes certain summaries of the MAF, among these public tabulations are the number of addresses in each block that represent housing units and the number that represent group quarters of different types. While the addresses associated with these housing units and group quarters remain confidential, the counts are used to guide local partners to areas they believe may be incomplete. In these areas, the local partners supply candidate addresses, which are used to update the MAF. Second, throughout the decade, the Census Bureau cooperates with local expert demographers to update the location and types of group quarters facilities in an area. Since group quarters are in scope for the American Community Survey, this cooperation also affects that survey. Critically, the local demographers use public information on facilities like prisons, nursing homes, dormitories, and barracks to facilitate timely updates of the MAF group quarters data. For operational reasons, then, the Census Bureau treats the count of physical addresses on a block, their designation as housing units or group quarters, and the type of group quarters facility as public information. Note that while unoccupied (vacant) housing units are enumerated in the census, vacant group quarters are not. Hence, TDA treats housing units and occupied group quarters as invariant. The Census Bureau policymakers interpret this decision as requiring the disclosure avoidance to be consistent with public information. This decision means that a prison, or other group quarters, will be properly labeled in census tabulations if it is properly labeled in the confidential data, but the number and characteristics of the population living in those facilities or in housing units is protected by the DP mechanisms.
The complete list of invariants implemented in TDA is:
total population for each state, the District of Columbia, and Puerto Rico;
the count of housing units in each block, but not the population living in these housing units;
the count and type of occupied group quarters in each block, but not the population living in these group quarters.
Edit constraints and structural zeros occur when a particular combination of values of the characteristics in the confidential data is deemed impossible a priori; that is, before any data are collected. For example, if the relationship of person two to person one (also called the householder) is “natural child,” then person two cannot have age greater than person one’s age. Edit constraints and structural zeros modify the sample space of the final output data. In the case of the DAS, this would be the postprocessed results (MDF) but not the intermediate noisy measurements.
Edit constraints are conceptually different from invariants; however, when examining the full constraint set in TDA, they are mathematically indistinguishable. Nevertheless, it is important to maintain the distinction. Edit constraints and structural zeros are imposed on the confidential data during the process of creating the Census Edited File. They are never violated in the input data to TDA by design; therefore, they must also be respected in the output data. Invariants, on the other hand, are statistics computed from the input data that deliberately skip the differential privacy mechanism. They are passed to the postprocessing without noise injection. Hence, the distinction is that edit constraints and structural zeros define the sample space for the census. They are features of its design, not part of the information collected from respondents. Edit constraints and structural zeros are public information known before data collection, while invariants are statistics collected in the census data that bypass the confidentiality protection system by policy.11
The complete edit specification for the 2020 Census Edited File is not a public document, nor is the complete edit specification for the 2010 Census Edited File. The version of TDA that processed the redistricting data product contains only a few edit constraints and structural zeros. The most important of these are listed here:
for the race variable, the combination “none of the above” is a structural zero;
for occupancy status, there are no unoccupied group quarters.
The race constraint is derived from OMB Statistical Policy Directive 15 (Office of Management and Budget, 1997), which requires the use of the categories listed in the redistricting data schema except for “some other race,” which was congressionally mandated (Consolidated Appropriations Act, 2010). Because a respondent cannot select “none of the above,” there are $2^61$ allowable combinations of the six race categories, not $2^6$. There are unoccupied group quarters on enumeration day, but they are removed from the MAF universe used in the Census Edited File.
The structural zero on unoccupied group quarters, the occupied group quarters invariant, and the housing unit invariant interact in creating the TDA constraint set. If there is an occupied group quarters facility of a particular type in the relevant geographic area, then at least one person must be assigned in postprocessing to that facility. If there is a housing unit in the particular geography, however, there is no similar requirement because occupancy status is not invariant for housing units. However, the number of householders (person one on the questionnaire) cannot be greater than the number of housing units. Aside from this, TDA treats the data for persons and housing units separately. Hence, occupancy status is protected without reference to the person data. This interaction is deliberate. Joining the housing unit data to the person data and then applying a differentially private mechanism to the joined data is a much harder problem because the sensitivity of queries (see Definition 3) about the composition of a household, including how many persons live in the housing unit, is the maximum allowable household size, not one. Join queries like this will be processed by DP mechanisms implemented in algorithms that do not postprocess to microdata. These data products will be released at a later date as part of the Detailed Demographic and Housing Characteristics (DetailedDHC) data release. Users may be tempted to divide the total household population in a particular geography from table P1 by the total number of occupied housing units in that geography from table H1. For wellpopulated geographies this answer will be useful; however, for sparsely populated geographies it will not be reliable. The join query DP mechanisms implemented in the DetailedDHC are designed to produce reliable data for average household size.
The TDA and the blockbyblock algorithm, which we describe in Appendix A, rely on DP mechanisms responsible for generating the noisy measurements. For the TDA, we use the discrete Gaussian mechanism, based on the discrete Gaussian distribution (Definition 4). The multivariate discrete Gaussian mechanism is described in Theorem 1. Setting $a=1$ yields the univariate case (Canonne et al., 2020, 2021). Let $\mathbb{Z}$ represent the space of integers and $\mathcal{X}^{n}$ a data set of $n$ records where $\mathcal{X}$ is the sample space or data universe. Theorem 1 uses the concept of query sensitivity provided in Definition 3. Query sensitivity is important because it affects the scale of the noise distribution for a fixed privacyloss value.
The discrete Gaussian mechanism was not always the DP mechanism in the TDA. In earlier work we used the geometric mechanism, which, like the discrete Gaussian mechanism, produces noise with integer values but uses the doublegeometric rather than the discrete Gaussian distribution. The discrete Gaussian mechanism was ultimately chosen based on empirical tests of the accuracy of the TDA using each mechanism with comparable privacyloss budgets. The discrete Gaussian distribution has smaller tail probabilities than the doublegeometric distribution, and this provably reduces the worstcase errors associated with the microdata requirement (Abowd, Ashmead, CommingsMenon, Garfinkel, et al., 2021).
The censustaking process results in a personlevel data set of $n$ records with $k$ characteristics (e.g., age, race, ethnicity) including geography. Each characteristic may take exactly one of a finite number of values, which define the sample space, $\mathbf{\chi}$. An equivalent definition exists for the sample space of housing units. The sample space for persons is defined by enumerating all combinations of the values of every person characteristic measured in the census. Thus, $c=\text{Cardinality}(\mathbf{\chi})$ is the number of rows in $\mathbf{\chi}$, and represents the total possible number of combinations of the characteristics of the persons. For the 2020 Census Redistricting Data (P.L. 94171) Summary File, $c = 11,879,679,168$ for the personlevel data and $c = 11,785,396$ for the housingunit–level data. Of course, these calculations show the dimensionality of these sample spaces for a single person or unit. There are combinatorially more feasible databases satisfying these schemas that are also consistent with all invariants.
Geographic levels play an important role in the methodology. It is useful to consider the cardinality of the sample space disregarding geography. Let the sample space without geography be represented by $\mathbf{\chi^*}$ and its cardinality by $c^*=\text{Cardinality}(\mathbf{\chi^*})$. For the redistricting data case, $c^* = 2,016$ for the personlevel data and $c^* = 2$ for the housingunit–level data.
Instead of representing the data as an $n \times k$ personlevel matrix (also called a table in the database literature), another way to represent the data is as a fully saturated contingency table (also called a histogram in the privacypreserving data publication literature) where every cell corresponds to a possible record type as defined by the schema and its value is the number of records of that type. Rather than using a multidimensional array, such as a contingency table, it is notationally convenient to consider instead the flattened table, a length $c$ vector flattened in any predefined ordering that spans the schema. Let $\mathbf{x}$ represent the contingency table vector.
A set of linear queries on $\mathbf{x}$ is represented by an $(a \times c)$ matrix $\mathbf{Q}$ and the vector of query answers of length $a$ is obtained by matrix multiplication of the query matrix and the contingency table vector: $\mathbf{Q} \mathbf{x}$. Let $\mathbf{\widetilde{M}}$ represent a differentially private random vector of answers to a set of linear queries. Measurements have the form $\mathbf{\widetilde{M}} = \mathbf{Q} \mathbf{x} + \mathbf{y}$, where $\mathbf{y}$ is a vector of independent random variables. Specifically in the context of the TDA, we assume $\mathbf{y} \sim \mathcal{N}_{\mathbb{Z}}(0, \sigma^2_j)$, where $j=1, \cdots, a$, a discrete Gaussian distribution with potentially unique variances $\sigma^2_j$ for each of the $a$ noisy measurements. In the TDA, to turn the set of $\sigma^2_j$ values into a $\rho$ value, we use Theorem 1. See Appendix B for additional details.
A key feature of census data and tabulation products is geography. We use the symbol $\mathop{\mathrm{\Gamma}}$ to represent a geographic hierarchy (United States, state, county, tract, block group, block for the United States; PR, muncipio, tract, block group, block for Puerto Rico): a directed, rooted tree with $\mathop{\mathrm{\mathop{\mathrm{\gamma}}_0}}$ as its root. Given a node $\mathop{\mathrm{\gamma}}\in \mathop{\mathrm{\Gamma}}$, we let $\mathop{\mathrm{parent}}(\mathop{\mathrm{\gamma}})$ represent its parent geographic node in the hierarchy and $\mathop{\mathrm{children}}(\mathop{\mathrm{\gamma}})$ represents the set of geographic nodes that are its children. We let $\mathop{\mathrm{leaves}}(\mathop{\mathrm{\gamma}})$ represent all of the leaves under node $\mathop{\mathrm{\gamma}}$ and $\mathop{\mathrm{leaves}}(\mathop{\mathrm{\Gamma}})$ to represent all of the leaves. In particular $\mathop{\mathrm{leaves}}(\mathop{\mathrm{\mathop{\mathrm{\gamma}}_0}})=\mathop{\mathrm{leaves}}(\mathop{\mathrm{\Gamma}})$. We refer to the set of all nodes at a fixed distance from $\mathop{\mathrm{\mathop{\mathrm{\gamma}}_0}}$ as a geographic level, which in the case of the census may be one of the United States, state, county, tract, block group, or block.12
Let $\mathbf{x}_{\mathop{\mathrm{\gamma}}}$ be the contingency table vector for a specific geographic node $\mathop{\mathrm{\gamma}}$. By definition, all cells of the vector not associated with the given geography are 0. Therefore it is convenient to view $\mathbf{x}_{\mathop{\mathrm{\gamma}}}$ as a length $c^*$ vector, disregarding the geographic dimension of the contingency table. Similarly, let $\mathbf{\widetilde{M}}_{\mathop{\mathrm{\gamma}}} = \mathbf{Q} \mathbf{x}_{\mathop{\mathrm{\gamma}}} + \mathbf{y}$ be a geography specific version of the differentially private random vector of answers to a set of linear queries. For the purposes of the TDA, it is necessary to group all the children of a node together to jointly estimate the solution. Let $\mathbf{x}_{\text{children}(\mathop{\mathrm{\gamma}})}$ be the contingency table vector obtained by stacking $\mathbf{x}_{\text{child}} %\; \forall % Note: addresses comment 21 \text{ for every} %\; \text{ child} \in \text{children}(\mathop{\mathrm{\gamma}})$, making it a $(c^* \times \text{children}(\mathop{\mathrm{\gamma}}))$ length vector where $\text{children}(\mathop{\mathrm{\gamma}})$ is the number of children of node $\mathop{\mathrm{\gamma}}$. Let $\mathbf{\widetilde{M}}_{\text{children}(\mathop{\mathrm{\gamma}})}$ represent a the similarly stacked version for the differentially private measurements.
The TopDown Algorithm processes the noisy measurements from the DP mechanism and the invariants into an estimate of the contingency table vector. We use $\hat{\bf{x}}$ or equivalently $\hat{\bf{x}}_{\mathop{\mathrm{\gamma}}}$, $\forall \mathop{\mathrm{\gamma}}\in \mathop{\mathrm{\Gamma}}$, to denote the output of TDA.
Depending on the specifics of the invariants and edit constraints, both equality and inequality constraints may be imposed within TDA. Specifically, in the redistricting data use case, inequality constraints are due to invariants imposed on the counts of housing units and occupied group quarters because these put bounds on the number of people that must be in that geographic unit. For example, for the invariant on the number of occupied group quarters facilities of a certain type by block, there must be at least one person per group quarters type in that block because vacant group quarters facilities are outofscope for the census. We denote the constraints on the algorithm output as a result of the invariants as a set of equality and inequality constraints:
where $\mathbf{c}^{eq}$ and $\mathbf{c}^{u}$ are calculated from $\mathbf{x}$ and possibly additional housing unit data. For example, total state populations are invariant; therefore, the state population for each state is an element of $\mathbf{c}^{eq}$ and the rows of $\mathbf{C}^{eq}$ force the elements of $\hat{\mathbf{x}}$ that correspond to components of the state population to sum to the invariant total. Another example is group quarters populations in any geographic unit. Group quarters must be occupied; therefore, if $k$ occupied group quarters facilities of a given type are present in a particular geographic unit, this generates a lowerbound constraint of $k$ that is an element of $\mathbf{c}^{u}$, and the rows of $\mathbf{C}^{u}$ force the elements of $\hat{\mathbf{x}}$ that correspond to components of the group quarters population of that type to sum to at least $k$.13
The census ultimately produces a set of tabulations that can be characterized as a query matrix $\mathbf{A}$ multiplied by the contingency table vector. Without any disclosure avoidance applied, the tabulations would be calculated as $\mathbf{A} \mathbf{x}$. While not the true value as this term is defined in classical statistics—the unknown population quantity that is the estimand for a census or survey—because of the uncertainty described in Section 4, it is treated as the reference value for purposes of disclosure avoidance because it is the result of answering the query directly from the confidential data. Therefore, the goal of the DAS is to implement a sequence of algorithms that gives optimal tabular estimates $\mathbf{A} \mathbf{\hat{x}}$ relative to $\mathbf{A} \mathbf{x}$ according to some metric for a given privacyloss budget.
The utility of the algorithms developed for the redistricting data was first and foremost designed to meet stringent fitnessforuse accuracy targets for the redistricting and Voting Rights Act use cases. This is a challenging task because the geographic units of interest are the voting districts that will be drawn after the data are published. Hence, the target geographies cannot be specified in advance. Fortunately, these voting districts divide political entities like states, counties, incorporated places, school districts, and tribal areas whose geographies are prespecified. Therefore, error metrics that optimize accuracy within these predefined political entities help properly control the metrics for their equalpopulation future voting districts. The collections of blocks in the voting districts are not arbitrary. They cover the political entity and form aggregates of approximately equal total population. When the political entity has a low population, there are usually only a few voting districts or none at all. While a complete description of the criteria for forming legislative districts is beyond the scope of this article, more details about the compactness, contiguity, political boundary, and community of interest requirements are provided by the nonpartisan organization that represents states’ interests in the development of redistricting data, the National Conference of State Legislatures (2021).
We performed explicit experimental parameter tuning using 2010 Census data to ensure that the largest racial/ethnic group (as defined in Statistical Policy Directive 15 [Office of Management and Budget, 1997]) in offspine geographies (e.g., minor civil divisions, census places, American Indian Reservations, etc.) with populations of at least $500$ people, expressed as a percentage of the total population, was within $\pm 5$ percentage points of the enumerated percentage at least $95\%$ of the time.14 We also examined the same metric expressed as a percentage of the votingage population and found that controlling the ratio to the total population also controlled the error in the percentage of the largest race/ethnicity group in the votingage population. This utility goal was developed in consultation with stakeholders and is analyzed extensively in Wright and Irimata (2021).
Independently, we also evaluated the performance of TDA by examining the absolute and relative error of many statistics by quantiles of the count in the 2010 CEF. These error metrics were used for total population, votingage population, number of races, population in each OMB race category, and population in each ethnicity category within OMB race category. These quantile statistics were used primarily to determine the interactions of changes in the privacyloss budget allocation among the queries, which cannot be done ex ante because, while the errors from the DP mechanism are independent of the data, the errors in postprocessing are not.
Our accuracy assessments included many metrics (U.S. Census Bureau, 2020b, 2021c) using the 2010 Census data that were developed by demographers at the Census Bureau or suggested by external users. They included many different error measures such as mean absolute error, mean absolute percent error, percent difference thresholds, outliers, as well as different characteristics of interest relevant to redistricting (total populations, total population 18 years and over, occupancy rates, Hispanic or Latino origin population, etc.) across many different geographic levels.
The existing policy and scientific literature provide very little guidance on managing a privacyloss budget to tradeoff accuracy on multiple dimensions. When tuning the TDA, our experiments were designed to find the smallest privacyloss budget that met specified accuracy goals. In particular, we used a tight standard for redistricting data accuracy in the smallest voting districts that the Department of Justice Voting Section provided as examples of Section 2 scrutiny under that law. Finding the minimum privacyloss budget that could achieve this goal illustrated where other accuracy objectives deteriorate—in particular, statistics at the tract and county level, which were downweighted by the redistricting accuracy measure. Other subject matter experts, internal and external, then laid out their accuracy goals for these other statistics. Internal statistical disclosure limitation experts, including some of the authors of this article, demonstrated that these accuracy goals—redistricting and general demographic—could be achieved with a modest privacyloss budget at the block level. The iterative experimental tuning process is described in more detail in Section 8.
The statistical optimization problem can be summarized as: given accuracy targets based on $\mathbf{A} \mathbf{\hat{x}}$ relative to $\mathbf{A} \mathbf{x}$, choose query matrices $\mathbf{Q}^* = \left\{ {\mathbf{Q}}_1, {\mathbf{Q}}_2 \cdots, {\mathbf{Q}}_m \right\}$ with corresponding privacyloss budgets $\rho^* =$ {$\rho_1$, $\rho_2$, $\cdots, \rho_m \}$ and estimator
that meets the accuracy targets such that the total privacyloss budget $\sum_{i=1}^{m} \rho_i$ is minimized, where $\hat{\mathbf{x}} \in \mathcal{Z}^{0+}_d$, meaning the estimated vector is of length $d$ and has nonnegative integer elements, and $\mathbf{C}^{eq} \hat{\mathbf{x}} = \mathbf{c}^{eq}$, and $\mathbf{C}^{u} \hat{\mathbf{x}} \leq \mathbf{c}^{u}$, meaning the solution satisfies the constraints. Here we denote ${\mathbf{M}}^*= \left\{\mathbf{\widetilde{\mathbf{M}}}_1, \mathbf{\widetilde{\mathbf{M}}}_2, \cdots, \mathbf{\widetilde{\mathbf{M}}}_m \right\}$ as the differentially private vector of noisy measurements corresponding to $\mathbf{Q}^*$ and $\rho^*$. Note the three major elements of the algorithm design:
choosing which query matrices to use for differentially private noisy measurements;
choosing the accuracy of each of the noisy measurements; and
optimally combining the noisy measurement information to produce a nonnegative integer estimate of $\mathbf{x}$ that satisfies all constraints.
While these algorithm elements should work handinhand, we note that there could be many variations of (3) using the same measurements that are a product of elements (1) and (2) that produce results with good utility. Not only are there many objective functions that can be used for optimization, but many variations of methods to produce nonnegative integers satisfying a set of constraints. Over the course of this work, we experimented with each of these elements in order to build the best algorithm for key redistricting and demographic usecases.
While it would be possible to ask a linear query $\mathbf{Q}$ with generic integer elements, we restrict ourselves to queries that consist of only binary elements. Furthermore, while mathematically it is often convenient to write a single query matrix for the entire problem or for a single geography, in practice we think of sets of queries stacked together into a single query matrix and define them in terms of dimensions of the schema. In order to explain the specifics of the queries used in TDA, as discussed extensively in Section 8, we describe here the structure of the complete query matrix, which is partitioned by level of the geographic hierarchy and marginal query group within each geographic level. Privacyloss budget ($\rho$) allocations are made to the geographic level and then to the marginal query group within each level. When these partitions are stacked vertically within geographic level and concatenated horizontally across the geographic hierarchy, they generate the full linear query matrix.
We use the term marginal query group to refer to a matrix with rows given by linear queries such that the product with the data vector, $\mathbf{x}$, provides the estimates for a given marginal of the data.15 For example, in the redistricting data, we might ask the census race marginal query group and the votingage marginal query group for a given geography. By the census race query group, we mean the 63 total population counts in each of the OMBdesignated race categories for the given geography. By the votingage query group, we mean the two total population counts for 17 or younger and 18 and older. We could also ask the query group census race by voting age which would be 126 counts of each of the levels of OMBdesignated race crossed by voting age. In addition to individual or crossed dimensions of the schema, we also consider the cross of all dimensions of the schema as well as the collapse of all dimensions, which we give special names: the detailed query and the total query, respectively. Lastly, we also consider Cartesian products of mutually exclusive collapses of a schema attribute as a query. For example, we could collapse the 63level census race attribute into a threelevel attribute with levels: one race selected, two races selected, and three or more races selected. A query strategy may also include Cartesian products of collapsed categories.
The individual marginal query groups we consider have two important properties. They are exhaustive in terms of the levels (we do not query 18 and older without also querying 17 and younger) and the levels are mutually exclusive. Along with the use of binary elements, these properties ensure that the $L_2$ sensitivity of a single marginal query group is at most $\sqrt{2}$. $L_p$ sensitivity for a single query as defined in Definition 3. In the query group it means for any two possible neighboring databases, where neighboring is defined as the result of changing the value of a single record in the database (which maintains the total number of records), the maximum over all the queries in the group is at most $\sqrt{2}$. Similarly to the sensitivity of a single query (Definition 3), the sensitivity of a query group is important because it affects the scale of the noise needed for a fixed privacyloss or $\rho$ value. The larger the sensitivity, the larger the scale of the noise that is needed in the algorithm. By construction, all queries in a query group receive the same precision (a function of the $\rho$ allocation and the sensitivity) for their DP mechanism noisy measurements, though the precision given to each query group may differ both within and between geographies.
The motivation for the TDA is to leverage the treelike hierarchical structure of census geographies in order to produce estimates of the contingency table vector that are more precise than algorithms focused on directly and independently estimating blocklevel vectors then aggregating16 in a scalable manner. The TDA can be thought of as two phases, a measurements phase and an estimation phase.17 Table 1 presents a schematic summary of the algorithm.
Measurement Phase 

(1) For the level $i \in$ {US, state, county, tract, block group, block} 
Estimation Phase 
(1) For the US root node $\gamma_{0}$ estimate the contingency table vector ${\mathbf{x}}_{\gamma_{0}}$ by 
First, in the measurement phase, sets of differentially private noisy measurements are taken about key queries at each of the geographic levels in the hierarchy. Because queries at different geographic levels (say block and tract) will involve the same individuals, the total privacy loss will be the summation of the privacy loss at the individual geographic levels. Therefore, the distribution of the total privacyloss budget across levels of the geography as well as the queries within a geographic level must be chosen with care. In the estimation phase, the goal is to produce a nonnegative integer estimate of the contingency table vector from the differentially private noisy measurements.
As the name implies, estimation is carried out sequentially, starting with the root geography (United States). The algorithm estimates the nonnegative integer U.S.specific contingency table vector from the set of U.S.level differentially private queries on the data, maintaining any equality or inequality constraints implied by the invariants or due to edit constraints and structural zeros. Due to the complexity of the problems, this is done in a twostep manner. First, we estimate a constrained nonnegative weighted least squares solution from the U.S.level differentially private queries that respects all constraints implied by the invariants. Second, we perform a controlled rounding that finds a nearby nonnegative solution that also respects the constraints. See Section 7.1 for more detail. Once the solution is found, the U.S.level estimated contingency table vector is considered fixed and used as an additional constraint on the next subproblem in the sequence.
The algorithm then estimates the statelevel contingency table vectors, enforcing an aggregation constraint to the U.S.level contingency table vector and using the statelevel differentially private measurements on the data. This is done using a similar twostep process: first, solving a weighted nonnegative least squares problem; then, solving a controlled rounding problem. However, in addition to any constraints that are the product of invariants or edit specifications, an aggregation constraint is added to both subproblems to ensure that the estimated statelevel contingency table vectors sum to the U.S.level estimated contingency table vector.
Fixing the estimated statelevel contingency table vectors, the estimation steps are repeated within each state in order to estimate each of the countylevel contingency table vectors, enforcing aggregation constraints for each state. The algorithm then proceeds iteratively to additional geographic levels (tract, block group, and block) until the contingency table vector is estimated for each block. In each step, the differentially private measurements taken at that level are used along with constraints due to invariants, edit specifications and aggregation.
In an ideal scenario, the contingency table vector $\mathbf{x}$ for all blocks could be estimated in its entirety simultaneously, minimizing error relative to all differentially private measurements at once and respecting invariant constraints; however, the scale of such a problem in the decennial census is several orders of magnitude too large given current computing limitations. Therefore, one way to view TDA is that it breaks the problem down into smaller, more computationally tractable subproblems. Even after doing so, the size of the problems remains quite large. The desired output of the algorithm is a nonnegative integer vector $\hat{\mathbf{x}}$ and therefore an integer programming approach would be the natural fit. Unfortunately, the size and complexity of the subproblems is still too large to directly solve as an integer program. Therefore, we further break down each optimization step into two subproblems: 1) a constrained weighted least squares optimization that finds an estimated nonnegative continuousvalued contingency table vector; and 2) a controlled rounding problem that solves a simpler integer program than in the direct method and yields a nonnegative integer contingency table vector. For simplicity, we refer to these as the least squares optimization problem and the rounder optimization problem, respectively.
In this section, we present the estimators for the least squares and rounder optimization problems, respectively, that we initially developed. Then, we provide the details of enhancements to these basic estimators using more complex estimators that find a solution by performing a series of optimizations and constraining passes. We detail these more complex estimators in Sections 7.2–7.3.
The basic least squares estimator. There are two versions of the least squares estimation problem depending upon whether the subproblem includes a single node—as in Step (1)(a) of the estimation phase of TDA—or multiple nodes—as in Step (2)(a) of the estimation phase of TDA. Let $\mathbf{W}$ be the diagonal matrix with the inverse variances of the differentially private random variables along its diagonal. Then, the least squares estimator for a single node $\gamma$ is
Now, let $\mathbf{W}$ be the diagonal matrix with the inverse variance of the differentially private random variables for each of the children of node $\gamma$ along its diagonal. While choices of $\mathbf{W}$ other than inverse variance were explored (e.g., constant, squareroot of variance), both theoretically (according to standard weighted least squares theory) and in practice, the inverse variance provided the best performance overall. Let $\mathbf{I}_{c^*}$ represent an identity matrix of size $c^*$. Then the least squares estimator for the joint solution for the children of $\gamma$ is
In (7.1) and (7.2) we use generic notation for $\mathbf{Q}, \mathbf{C}^{eq}, \mathbf{c}^{eq}, \mathbf{C}^{u}$, and $\mathbf{c}^{u}$ rather than node specific subscripts in an effort to simplify notation, though it should be noted that the values of $\mathbf{Q}, \mathbf{C}^{eq}, \mathbf{c}^{eq}, \mathbf{C}^{u}$, and $\mathbf{c}^{u}$ will be different depending on the specific child or parent node.
The basic rounder estimator. Similar to the least squares optimization problem, we specify two versions of the solution depending on whether we are estimating over a single or multiple nodes. Let $\lfloor{\mathbf{x}}\rfloor$ be the floor function (applied elementwise to a vector). Then the rounder estimator for a single node $\gamma$ is
Then the rounder estimator for the joint solution for the children of $\gamma$ is
DAS implements the rounder problem by subtracting the floor and minimizing the distance over the fractional part of the target vector. This problem is feasible and efficiently solvable if it has the total unimodularity (TUM) property (see Section 7.2).
While we found that the basic least squares estimator worked well overall, we observed a concerning trend in the bias of the estimation of geographic totals. Even when these queries were asked with very high precision, when combined in an optimization with a query with a large number of components (e.g., the detailed query), the resulting estimated data vector would be less accurate on the target query than would be expected. We know that this problem is due to the constraint that all cell counts must be nonnegative. We now understand that there is an inherent uncertainty principle limiting the accuracy of inequalityconstrained least squares. The maximum expected error in either the detailed query or its total depends upon both the privacyloss budget allocation and the total number of queries (Abowd, Ashmead, CumingsMenon, Garfinkel, et al., 2021).18 Managing this accuracy tradeoff between detailed and total queries, which is due entirely to the postprocessing requirement to produce privacyprotected microdata, motivated the algorithmic improvements discussed here.
To improve on the singlepass nonlinear least squares estimator, we developed a multipass estimator. Instead of attempting to find a data vector solution that minimizes errors for all noisy queries at once, we use multiple passes of the optimizer to estimate increasingly more detailed tabulations of the data vector. On each pass, after estimating one or more margins of the contingencytable vector, we constrain the following passes to find a solution that maintains the previously estimated margins. For example, for the redistricting data persons schema, we might first estimate and constrain the total population counts in a first pass. Then, estimate the detailed query in a second pass subject to the constraint that its margin equals the total population count from the first pass. Multipass processing, therefore, controls the algorithmic error due to the nonnegativity constraints by prioritizing the queries based on the expected size of the total query (total population before votingage population, etc.).
The input of the least squares multipass methods described here is a collection of marginal queries and, for each such query group, the vector of DP random measurements for the query group. Let $k\in \{1, \dots, K\}$ represent the passes such that $\tilde{\mathbf{x}}_{\text{children}(\gamma)}^{(k)}$ is the estimated data vector after pass $k$, $\mathbf{Q}^{(k)}$ is the stacked set of query groups used for pass $k$, and $\widetilde{\mathbf{M}}_{\text{children}(\gamma)}^{(k)}$ is the corresponding DP random measurements. Let $\mathbf{B}^{(k)}$ be the matrix formed by stacking query group matrices used for pass $k$; it will be used for constraints in passes greater than $k$. For brevity, we omit the single node version of the estimator and focus on the joint solution for the children of $\gamma$ which is given by $\tilde{\mathbf{x}}_{\text{children}(\gamma)}^{(K)}$, where the optimization routine is given by (7.5).
The parameter $\hat{\tau}^{(j)}$ represents a small positive constant to help ensure feasibility of future solutions. Rather than specifying this number, we solve the following secondary optimization problem to estimate it at the end of each pass $k$:
Note that the basic least squares estimator can be recovered by choosing $K=1$ and including all query groups for which we have DP random measurements. We have found it generally advantageous to first estimate query groups with elements that are unlikely to be zero using an estimator with negligible bias. Constraining the estimates in subsequent passes to have marginals that are consistent with these earlier query group estimates has the additional effect of reducing the positive bias on cell counts with true values that are zero. Our current estimation strategy limits the positive bias of query group estimates in early passes by omitting query groups from the objective function that are likely to have many true answers that are zero. For example, choosing the first pass to only include query groups that are higherorder marginals (e.g., total, votingage). However, note that by excluding DP random measurements, we are not including all available information. Note that the choice of query groups does not depend on the confidential data.
The basic rounder estimator finds a nearby integer solution by minimizing the distance from the realvalued solution with respect to the individual cell values while still respecting constraints. We found that, as the size of the optimization problems grew, even though in the rounder solution individual cells were forced to be close to their realvalued counterparts, the summation of many cells (e.g., a variable margin) could still change by a large amount. Changes to the individual detailed cell values could add up to make a large change in, for example, the votingage population. To improve this behavior, we developed an estimation routine that optimizes based on the distance from the realvalued solutions to the queries used in the least squares estimator. Like the least squares multipass estimator, we also allowed for multiple passes of the rounder at increasing levels of detail.
Because the nonnegative least squares estimator is solved numerically, the amount of mass (accumulated fractional parts of the least squares solution) the rounder was responsible for estimating increased as privacyloss budgets increased. That is, as we increased the privacyloss budget, we wanted to ensure monotone convergence to perfect accuracy. However, when using the original, basic singlepass rounder, as the privacyloss budget increased many more cell estimates from the least squares problem were 0.999 (not 1.0). As this happened, because the rounder is responsible for estimating a total amount of mass that it is equal to the sum of the fractional parts of the least squares estimator, the rounder became increasingly important to solution quality the larger the privacyloss budget. Rounding using only the detailed query answers is clearly an inferior estimator of most queries of interest compared to using all queries from the least squares solution. Hence, basic rounding caused very troubling behavior: specifically, an intermediate region in which increased privacyloss budget resulted in poorer performance on utility metrics. When we implemented multipass rounding, we also added the ability to specify additional queries as part of the rounder’s objective function. TUM implies restrictions on these extra queries. Without violating TUM, we inserted a single additional set of hierarchically related queries (a ‘tree of queries’) in this way. This modification of the objective function solved the nonmonotonicity problem.
Let $k\in \{1, \dots, K\}$ represent the passes such that $\hat{\mathbf{x}}_{\text{children}(\gamma)}^{(k)}$ is the estimated data vector after pass $k$,19 and $\mathbf{Q}^{(k)}$ is the stacked set of query groups used for pass $k$ (which can be different from the query groups used in the multipass least squares estimator). Meanwhile, $\tilde{\mathbf{x}}_{\text{children}(\gamma)}$ is the vector produced by the multipass least squares estimator. Let $\mathbf{B}^{(k)}$ be the matrix for the stacked query groups that will be the used as constraints in passes greater than $k$. For brevity, we omit the single node version of the estimator, and focus on the joint solution for the children of $\gamma$. The solution is denoted by $\hat{\mathbf{x}}_{\text{children}(\gamma)}^{(K)}$, the terminal value in the following optimization letting $k=1, \dots, K$:
We note that the existence of a solution for the multipass rounder is not always guaranteed and depends heavily on the choice of the query groups used in each pass and their ordering. In particular, to guarantee polynomial time solution of the rounder estimator, the mathematical constraints must exhibit TUM, the condition that every square nonsingular submatrix is unimodular (integer and has determinant +1 or 1). TUM does two things. First, if a continuous solution exists to the nonnegative least squares problem problem, then TUM implies that an integer solution exists to the rounder problem. Second, TUM implies that finding the rounder solution is tractable. TUM does not, unfortunately, guarantee that the continuous nonnegative least squares problem has a solution in the first place. Because the mathematical constraints in the rounder estimator are composed of interactions between the invariants, edit constraints, and hierarchical consistency constraints, ensuring TUM requires limiting invariants and edit constraints as much as possible.
American Indian and Alaska Native (AIAN) areas tend to be far off the standard tabulation spine and, therefore, are at risk of poor accuracy for a given privacyloss budget. For example, the largest AIAN tribal area is the Navajo nation, which includes subsets of many counties within the states of New Mexico, Arizona, and Utah. To improve the accuracy of the AIAN areas, we define what we call the AIAN spine, which subdivides each state with any AIAN tribal areas into two geographic entities: the union of the legally defined AIAN areas within the respective state and the remaining portion of the state.20 Counties, tracts, and block groups are also subdivided as necessary to be fully nested within either the AIAN branch or the nonAIAN branch of the state’s hierarchy. Note that AIAN areas, like all tabulation geographic entities, are composed of tabulation blocks; therefore the creation of this spine did not require the census blocks to be subdivided.
In the 2010 Census geography definitions, there were 36 states with one or more AIAN area on the spine. That number increased to 37 in 2020. Functionally, these new geographic units are treated as state equivalents in the TDA. Meaning there are 87 (88 in 2020) stateequivalent geographies within the United States used in TDA.
Creating the AIAN spine created a subtle, but important, modification to the way the least squares and rounder problems were solved. In states with AIAN areas on the spine, the TDA hierarchy branches at the state level. This means that at the state level, the AIAN total population and the nonAIAN total population are solved with privacyloss budget. Only their sum is invariant. For states without AIAN areas on the spine total population is invariant and there is no need to estimate AIAN and nonAIAN components.
Offspine entities are those geographic entities that are not directly estimated as part of the TDA spine. Using the standard census tabulation spine (United States, state, county, tract, block group, block), many legally defined offspine entities are far away from the tabulation spine. The relevant distance measure is the minimum number of onspine entities added to or subtracted from one another in order to define the offspine entity, which we call the offspine distance. There is an inverse relation between offspine distance and count accuracy for offspine entities. As an entity’s offspine distance increases, the accuracy of its estimated characteristics decreases due to the accumulation of independent random noise from the DP mechanism applied to each of these components. Our approach to optimizing the geographic spine is discussed in detail in Abowd, Ashmead, CumingsMenon, Kifer, et al. (2021). In this section, we give the provide the main details.
In order to improve the accuracy of legally defined offspine entities, we designed an algorithm to reconfigure the standard tabulation spine for use within the TDA. This was done in two ways. First, we created custom block groups, which are regroupings of the tabulation blocks such that offspine entities are closer to the spine and group quarters facilities of each type are isolated from their surrounding housingunit only areas. Isolating group quarters in this manner improves the accuracy of estimated characteristics in both the group quarters and surrounding housingunit only blocks. Second, we bypass parent geographies with only a single child, combining the privacyloss budget for the child with that of the parent, which avoids wasting privacyloss budget on redundant measurements and yields more accurate estimates. The accuracy gain occurs because the algorithm proceeds in a topdown fashion; hence, in the case of a single child geography all measurements must exactly equal those of the parent. Additional privacyloss budget used on the child after the parental estimation is wasted. This optimization step effectively collapses the parent and child into a single node when the parent’s fanout is one and uses the combined privacyloss budget of the parent and child to take DP measurements once for this node.
We applied spine optimization after defining the AIAN spine. Hence, the AIAN areas and the balance of the state were optimized in separate branches of the hierarchy. To implement spine optimization, the Geography Division provided a list of the major subcounty governmental entities within each state. For portions of strong Minor Civil Division (MCD) states outside of AIAN areas, these entities were MCDs. In all other areas, the major subcounty entities were a combination of incorporated and censusdesignated places. As with states, counties, and tracts using the AIAN spine, these substate entities were divided into the part within an AIAN area and the balance of the substate entity. Call these substate entities the targeted offspine entities. We minimized the offspine distance for the targeted offspine entities using a greedy algorithm that created custom block groups to reduce the number of onspine entities required to build the targeted offspine entities. The custom block groups replaced tabulation block groups within TDA; however, the official tabulation block groups are used for all reported statistics. That is, custom block groups are an implementation detail for TDA, not a replacement for tabulation block groups. The spine optimization algorithm could also have created custom tract groups, but this feature was not implemented for the redistricting data.
Table 2 shows selected statistics for geographic entities at each level of the hierarchy after applying the spine optimization to the AIAN spine respectively for 2010 and 2020. The number of entities for state, county, tract and block group differ from the number of tabulation tabulation geographic entities because the AIAN spine subdivides geographies with AIAN areas. Effectively, this moves tabulation counties, tracts, and block groups slightly off spine in order to move AIAN areas, MCDs, and census places closer to the spine.
2010  2020  

States with AIAN areas on the Spine (FIPS Codes)  01, 02, 04, 06, 08, 09, 12,  01, 02, 04, 06, 08, 09, 12, 
Entities State Level  87  88 
Entities County Level  3,488  3,496 
Entities Tract Level  73,180  84,589 
Tabulation Block Groups  217,740  239,780 
Custom Block Groups**  402,020  409,548 
Tabulation Blocks  11,078,297  8,132,968 
Tabulation Blocks with Occupied GQs  112,956  126,723 
Tabulation Blocks with  6,379,963  5,879,049 
Tabulation Blocks with Potential Positive Population  6,398,202  5,892,698 
^{*Rhode Island has an American Indian and Alaska Native (AIAN) area in 2020; however, it is not included in the list of states with AIAN areas on the spine because there are no housing units or occupied Group Quarters (GQs) in this AIAN area. Therefore, the blocks within this AIAN area are not included in the spine at all because they have a zero probability, a priori, of containing positive resident population. }
^{**Custom block groups (CBG) used within the TopDown Algorithm (TDA) differ from tabulation block groups. These differences improve accuracy for offspine geographies like places and minor civil divisions. The use of CBG for measurement and postprocessing within TDA does not impact how the resulting data are tabulated. All 2020 Census data products will be tabulated using the official tabulation block groups as defined by the Census Bureau’s Geography Division.}
The TDA implementation within the DAS consists of two primary parts: an underlying framework and the specialization of the framework for the implementation in 2020 Census.
The DAS framework is a set of 22 Python source files (2,814 lines of code and 1,125 lines of unit tests) that provides a generic framework for implementing a batch data processing system. The framework provides for explicit modules that read a configuration file and then use the information in that file to initialize the remainder of the application, read the input data (the reader), perform the differential privacy computation (the engine), validate the results of the computation (the validator), and write out the results (the writer). The actual reader, engine, validator, and writer are implemented by their own Python classes that are also specified in the configuration file. Each module can in turn read additional parameters from the configuration file. Thus, the behavior of the entire system is determined by the configuration file, making it easy to use the framework for both algorithmic experimentation as well as production by specifying different configuration files.
The TDA for the 2020 Census is written within the DAS Framework. It consists of 461 Python source files (78,428 lines of program files and 11,513 lines of unit tests). The system includes 10 readers, 20 writers, and multiple engines. Only one reader, writer, and engine are used for any given run. To document the provenance for each TDA run, the DAS framework automatically determines from the configuration file the specific Python files required and saves this information in XML and in Portable Document Format (PDF) in the DAS Certificate. The system also creates an archive containing the complete source code run. All TDA run data and metadata are archived in the AWS Simple Storage Service (S3) with vintage tags.
All DP mechanisms implemented in TDA require a source of random numbers. We estimate that the 2020 Census will require roughly 90TB of random bytes to protect the person and housing unit tables. We generate random bits in the AWS environment using the Intel RDRAND instruction mixed with bits from /dev/urandom (Garfinkel & Leclerc, 2020). Because we used an implementation of the Canonne et al. (2020, 2021) exact sampling approach for the discrete Gaussian mechanism, and allocated privacyloss budgets using rational numbers, our implementation avoids the vulnerabilities often associated with floatingpoint random numbers as noted by Mironov (2012).
The 2010 Census data represent the best source of real data for conducting utility experiments relevant to the 2020 census without using the 2020 data. Starting in October 2019, the Census Bureau released a series of demonstration data products based on the 2010 data (U.S. Census Bureau, 2021d). Each set consisted of privacyprotected microdata files (PPMF) generated from the microdata detail file (MDF) that supported at least the redistricting data tables. When released, each demonstration data product represented the current state of DAS development.
It is not possible to tune a formally private disclosure avoidance system by directly using the confidential data it is designed to protect. Doing so causes the values of the tuning parameters—the privacyloss budget allocations and the query strategies—to leak information about the confidential data. This is the reason that the TDA was not tuned using the 2020 Census Edited File. It is also the reason why quality assurance of the MDF is a difficult conceptual task. Historically, the disclosure avoidance methods used for the decennial census were tuned using data from the same census. Record swaps were rejected if the resulting swapped data did not meet particular accuracy standards. For historical reasons and because of errors associated with the microdata requirement, quality assurance for the 2020 DAS MDF does not strictly adhere to the tenets of formal privacy. The MDF receives human review examining it for errors that look unusual, surprising, or unacceptable compared to the 2020 Census Edited File. Notification of such errors may be escalated up the organizational hierarchy for review and decisionmaking. As noted in Section 10, there were no such errors flagged in the review of the production redistricting data; however, this situation highlights the need for formally private techniques for quality assurance.
The demonstration data product dated April 28, 2021, was the final preproduction release. This PPMF consisted of person and housingunit–level microdata with two global privacyloss budgets: $\rho=1.095$ ($1.05$ for persons and $0.045$ for housing units) or $\epsilon = 11.14$ and $\rho=0.1885$ ($0.185$ for persons and $0.0035$ for housing units) or $\epsilon = 4.36$.21
Table 3 gives the proportion of the $\rho$budget per geographic level for both persons and housing units in the April 2021 release. For the estimation of the persons tables, the blocklevel queries received approximately half of the total privacyloss budget. For housing units, blockgrouplevel queries received the predominant share of the privacyloss budget.
Geographic Level  Persons $\rho$ Proportions  Housing Units $\rho$ Proportions 

U.S.  $\frac{51}{1024}$  $\frac{1}{1024}$ 
State  $\frac{153}{1024}$  $\frac{1}{1024}$ 
County  $\frac{78}{1024}$  $\frac{18}{1024}$ 
Tract  $\frac{51}{1024}$  $\frac{75}{1024}$ 
Custom Block Group*  $\frac{172}{1024}$  $\frac{906}{1024}$ 
Block  $\frac{519}{1024}$  $\frac{23}{1024}$ 
^{*The custom block groups used within TDA differ from tabulation block groups.}
Tables 4 and 5 show the proportions of the total $\rho$budget allocations per query within each geographic level for the persons and households tables respectively. These should be interpreted in conjunction with Table 3 in order to calculate the proportion of the total budget, which a levelspecific query represents. For example, the County TOTAL query represents $342/1024 \times 78/1024$, or approximately $2.5\%$ of the total privacyloss budget for persons.22
The set of queries and the values used per geographic level and per query were determined by a set of internal experiments that took place over a period of approximately five months, beginning in December 2020 and ending in early April 2021. These internal experiments estimated the minimum global privacyloss budget and its allocation to specific queries in order to ensure fitnessforuse in the redistricting application. See Wright and Irimata (2021) for details. These allocations were reflected in the April 2021 demonstration data product. Subsequent analysis by internal and external stakeholders addressed accuracy criteria for other, primarily demographic, uses. These allocations are reflected in the final production settings.
The initial, systematic tuning phase specified a quantitative metric targeted at the redistricting use case: for at least 95% of geographic units, the largest of a selected set of CENRACE $\times$ HISPANIC populations, corresponding to the major race and ethnicity groups in Statistical Policy Directive 15, should not change by more than $\pm 5$ percentage points when expressed as a percentage of the geographic unit’s total population compared to its enumerated percentage of the total population in that geographic unit. Due to computational constraints of creating many independent samples, geographic variability was used for evaluating the 95% criterion. While not a direct substitute for evaluating variability across many runs, we have observed that these metrics are very consistent from run to run, making them useful in their own right.
The TDA has error distributions that are datadependent in a fashion that cannot be expressed in closed form and depends on confidential data values that cannot be published. Therefore, we performed the fitting to the redistricting usecase iteratively rather than using the ex ante properties of zCDP. We used six candidate query strategies (queries for which formally private measurements are taken), created by defining two dimensions. First, query strategies varied by either using measurements that closely conformed to those in the publication specifications for the redistricting data product, or by collapsing/excluding certain marginal queries to try to take advantage of dependence among queries. Second, because both formal privacy and certain measures of error in TDA’s estimation are especially sensitive to cells with small counts, we assigned budgets to queries in one of three ways: proportional to the square of query cardinality, inversely proportional to the square of query cardinality, or uniformly assigned. Crossing these two dimensions generated six candidate strategies. Optimization passes were ordered heuristically, with queries assigned large relative budgets typically isolated in their own passes. In the strategies with almost all privacyloss budget assigned to the DETAILED query (proportional to the square of query cardinality), in particular, queries were optimized simultaneously, in a single pass.
Strategies that assigned weight according to the inversesquare of query cardinality were quickly ruled out as noncompetitive. For the remaining four strategies, we iterated experimentally, geographic level by geographic level starting at the top with the United States. First, using binary search, for each geographic level, we identified the approximate assignment of $\rho$ to that geographic level’s queries needed to satisfy the redistricting criterion, when the other geographic levels were assigned infinite budget. We then proceeded to account for dependence between geographic levels. We set United States and state to their singlelevel fitted $\rho$ values, and assigned the remaining levels infinite budget. We then increased the U.S. and state $\rho$ budgets proportionally until the criterion was satisfied. Next, we set the county budget to its singlelevel fitted $\rho$, and further increased U.S., state, and county budgets proportionally until the criterion was satisfied. We proceeded down the hierarchy until all geographic levels had been assigned a $\rho$ value. This procedure led to the initial identification of a total $\rho$ of $1.095$ as satisfying the redistricting criterion. While it may not be obvious from this description, because the redistricting criterion requires statistics for offspine entities, every experimental run of TDA processed data down to the block level. This is the reason tuning took several months to accomplish.
The April 2021 demonstration data product used the $L_2$ and $L_1$ multipass components. Specifically, at the U.S. level, we used a single pass for both the $L_2$ and $L_1$ phases with all the queries in Table 4 used for both phases. The state, county, tract, and block group geographic levels used two passes each. For both the $L_2$ and $L_1$ phases, the first pass consisted of the TOTAL query used both as the noisy estimate and constraint. The second pass consisted of the remaining queries in Table 4 for both phases. Finally, at the block level, a single pass was used for both phases with all the queries from Table 4 for both phases. The choice of the number of passes and the specific queries used in each pass were tested as part of the redistricting experiments, though we do not show any comparisons here with alternatives.
Query  Per Query $\rho$ Allocation Proportions by Geographic Level  

U.S.  State  County  Tract  CBG*  Block  
TOTAL (1 cell) 
 $\frac{678}{{1024}^{**}}$  $\frac{342}{1024}$  $\frac{1}{1024}$  $\frac{527}{1024}$  $\frac{1}{1024}$ 
CENRACE (63 cells)  $\frac{2}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{2}{1024}$  $\frac{1}{1024}$  $\frac{2}{1024}$ 
HISPANIC (2 cells)  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$ 
VOTINGAGE (2 cells)  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$ 
HHINSTLEVELS (3 cells)  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$ 
HHGQ (8 cells)  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$ 
HISPANIC × CENRACE (126 cells)  $\frac{5}{1024}$  $\frac{2}{1024}$  $\frac{3}{1024}$  $\frac{5}{1024}$  $\frac{3}{1024}$  $\frac{5}{1024}$ 
VOTINGAGE × CENRACE (126 cells)  $\frac{5}{1024}$  $\frac{2}{1024}$  $\frac{3}{1024}$  $\frac{5}{1024}$  $\frac{3}{1024}$  $\frac{5}{1024}$ 
VOTINGAGE × HISPANIC (4 cells)  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$  $\frac{1}{1024}$ 
VOTINGAGE × HISPANIC × CENRACE (252 cells) 






HHGQ × VOTINGAGE × HISPANIC × CENRACE (2,016 cells) 






^{*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm (TDA)}
^{**The TOTAL query (total population) is held invariant at the state level. This }$\rho$^{ allocation assigned to TOTAL at the state level is the amount assigned to the statelevel queries for the total population of all American Indian and Alaska Native (AIAN) tribal areas within the state and for the total population of the remainder of the state, for the 36 states that include AIAN tribal areas.}
Query  Per Query$\rho$ Allocation Proportions by Geographic Level  

U.S.  State  County  Tract  CBG*  Block  
Occupancy Status (2 cells)  1  1  1  1  1  1 
^{*Custom block groups (CBG) differ from tabulation block groups and are only used by the TopDown Algorithm (TDA).}
For the production run of the 2020 Census Redistricting Data (P.L. 94171) Summary File, a total privacyloss budget of $\rho = 2.63$ was used with $\rho=2.56$ used for person tables and $\rho=0.07$ for housingunits tables, considerable increases from the larger of the April 2021 settings. Production settings of the geographiclevel $\rho$ proportions are given in Table 6 and the query $\rho$ proportions by geographic level are given in Tables 7 and 8 for persons and housing units, respectively. Note, in particular, the shift in the $\rho$ allocation for the personlevel data away from the block level and onto the custom block group and tract level. This shift allowed the location confidentiality protection of respondents, which is primarily controlled by the allocation to the block level, to improve while meeting even tighter accuracy targets for HISPANIC $\times$ CENRACE at the tract and blockgroup level. This reallocation relied upon the efficiency gain from the spine optimization. The production settings used the same estimation strategy ($L_2$ and $L_1$ multipass components) and multipass orderings as the April 2021 release.
Geographic Level  Person $\rho$ Proportions  Housing Units $\rho$ Proportions 

U.S.  $\frac{104}{4099}$  $\frac{1}{205}$ 
State  $\frac{1440}{4099}$  $\frac{1}{205}$ 
Country  $\frac{104}{4099}$  $\frac{7}{82}$ 
Tract  $\frac{687}{4099}$  $\frac{346}{1025}$ 
Custom Block Group*  $\frac{1256}{4099}$  $\frac{1759}{4100}$ 
Block  $\frac{165}{4099}$  $\frac{99}{820}$ 
^{*Custom block groups differ from tabulation block groups and are only used by the TopDown Algorithm.}
Query  Per Query $\rho$ Allocation Proportions by Geographic Level  

U.S.  State  County  Tract  CBG*  Block  
TOTAL (1 cell)  $\frac{3773}{4097}$  $\frac{3126}{4097}$  $\frac{1567}{4102}$  $\frac{1705}{4099}$  $\frac{5}{4097}$  
CENRACE (63 cells)  $\frac{52}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{4}{2051}$  $\frac{3}{4099}$  $\frac{9}{4097}$ 
HISPANIC (2 cells)  $\frac{26}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{5}{4102}$  $\frac{3}{4099}$  $\frac{5}{4097}$ 
VOLTINGAGE (2 cells)  $\frac{26}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{5}{4102}$  $\frac{3}{4099}$  $\frac{5}{4097}$ 
HHINSTLEVELS (3 cells)  $\frac{26}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{5}{4102}$  $\frac{3}{4099}$  $\frac{5}{4097}$ 
HHGQ (8 cells)  $\frac{26}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{5}{4102}$  $\frac{3}{4099}$  $\frac{5}{4097}$ 
HISPANIC × CENRACE (126 cells)  $\frac{130}{4097}$  $\frac{12}{4097}$  $\frac{28}{4097}$  $\frac{1933}{4102}$  $\frac{1055}{4099}$  $\frac{21}{4097}$ 
VOTINGAGE × CENRACE (126 cells)  $\frac{130}{4097}$  $\frac{12}{4097}$  $\frac{28}{4097}$  $\frac{10}{2051}$  $\frac{9}{4099}$  $\frac{21}{4097}$ 
VOTINGAGE × HISPANIC (4 cells)  $\frac{26}{4097}$  $\frac{6}{4097}$  $\frac{10}{4097}$  $\frac{5}{4102}$  $\frac{3}{4099}$  $\frac{5}{4097}$ 
VOTINGAGE × HISPANIC × CENRACE (252 cells)  $\frac{26}{241}$  $\frac{2}{241}$ 