Abstract

The U.S. Census Bureau plans to use a new disclosure avoidance technique based on differential privacy to protect respondent confidentiality for the 2020 Decennial Census of Population and Housing. Their new technique injects noise based on a number of parameters into published statistics. While the noise injection does protect respondent confidentiality, it achieves the protection at the cost of less accurate data. To better understand the impact that differential privacy has on accuracy, we compare data from the complete-count 1940 Census with multiple differentially private versions of the same data set. We examine the absolute and relative accuracy of population counts in total and by race for multiple geographic levels, and we compare commonly used measures of residential segregation computed from these data sets. We find that accuracy varies by the global privacy-loss budget and the allocation of the privacy-loss budget to geographic levels (e.g., states, counties, enumeration district) and queries. For measures of segregation, we observe situations where the differentially private data indicate less segregation than the original data and situations where the differentially private data indicate more segregation than the original data. The sensitivity of accuracy to the overall global privacy-loss budget and its allocation highlight the fundamental importance of these policy decisions. Data producers like the U.S. Census Bureau must collaborate with users not only to determine the most useful set of parameters to receive allocations of the privacy-loss budget, but also to provide documentation and tools for users to gauge the reliability and validity of statistics from publicly released data products. If they do not, producers may create statistics that are unusable or misleading for the wide variety of use cases that rely on those statistics.

Keywords: differential privacy, residential segregation, accuracy, census

1. Introduction

Publishing useful public data while protecting the confidentiality of respondents is a balancing act continually performed by the U.S. Census Bureau and other federal statistical agencies. From 1960 to 2018, the Census Bureau used traditional disclosure avoidance techniques such as swapping and cell suppression to protect confidentiality.1 In Fall 2018, the Census Bureau announced that they were abandoning these traditional techniques for data that would be published from the 2020 Decennial Census. Instead, they will use a new technique, differential privacy, to protect confidentiality (Abowd, 2018a, 2018b; U.S. Census Bureau, 2018). Differential privacy “marks a sea change for the way that official statistics are produced and published” (Garfinkel et al., 2018, p. 136).

The change from traditional disclosure avoidance techniques to differential privacy was motivated by a database reconstruction and reidentification attack undertaken by Census Bureau researchers (Abowd, 2016a, 2016b; Leclerc, 2019a). Using census block and tract summary tables from the 2010 Decennial Census, researchers reconstructed 308,745,538 census microdata records. Each record included an individual’s census block, age, sex, race, and indicator for Hispanic ethnicity. Next, they executed a reidentification attack using an external, commercial microdata source that included a name, address, sex, and date of birth for each record. Forty-five percent of the 2010 Decennial microdata records putatively matched the commercial data set, and 38% of those putative matches were confirmed as exact matches. In other words, the attack successfully reidentified 17%—about one in six—of 2010 microdata records.

Given the results of this reidentification attack, as well as increases in computing power and massive growth in the volume of individual-level data collected by commercial entities, the Census Bureau concluded that traditional disclosure avoidance no longer sufficiently protected confidentiality, which the bureau is bound to do by law. Upholding confidentiality requirements would require new methods, and the Census Bureau decided to adopt differential privacy for disclosure avoidance.

Yet, we collectively know very little about how this decision will affect the accuracy of statistics and published data. Understanding these impacts is incredibly important, as census data are used for myriad purposes by innumerable stakeholders. For example, local city and county planners use census data to try to understand commuting and traffic patterns and provide first responder services. Businesses use the data to decide where to locate physical plants, considering both the availability of talent and potential markets (National Research Council, 1995). Social science researchers use the data to document racial and economic segregation or evaluate the employment and earnings impacts of changes to the minimum wage. The federal and state governments use the data to determine the annual allocation of up to $1.5 trillion of federal and state funds for programs ranging from Medicaid to highway construction and maintenance to K-12 education (Gordon, 2019; Hotchkiss & Phelan, 2017; Reamer, 2019). If the implementation of differential privacy changes the reliability or interpretation of released statistics and data, all of the above data uses, and many more, will have to change (Brummet et al., 2020; Ruggles et al., 2019).2

To assess the impact that the Census Bureau’s differential privacy algorithm has on decennial census data, in this article two teams of researchers, one from IPUMS at the University of Minnesota and the other from the W.E. Upjohn Institute for Employment Research, carry out two pairs of analyses that compare 100%, full-count census data across versions that have, and have not, had the bureau’s differential privacy algorithm applied.3 These analyses were conducted on the 1940 Census—the most recent census to be wholly released into the public domain.4

For the first exercise, the Upjohn team first graphically compares county population totals—overall and for Whites and Blacks, separately—across the different sets of data. This highlights how differential privacy may affect even the simplest use of the data—counts of people—at common sub-state levels of geography. The IPUMS team then computes mean absolute errors of population totals based on their own application of the census algorithm to the 1940 Census data, paying particular attention to how different privacy-loss budgets and different privacy-loss budget allocations—both discussed below—affect accuracy.

For the second exercise, the two teams compare commonly used racial segregation measures—including the index of dissimilarity (D), multigroup entropy (H), and isolation (B)—calculated from differentially private data with measures computed from the 1940 full-count microdata. These measures are commonly used by demographers and social scientists to understand how different groups of people cluster together at small levels of geography, and they can be sensitive to data inaccuracy.

The first section of our article provides a brief explanation and discussion of differential privacy. Sections 2 and 3 describe our data and methods. We present our results in section 4, and then discuss these results in section 5.

2. What Is Differential Privacy?

The road to differential privacy began in 2003 when Irit Dinur and Kobbi Nissim developed an efficient mechanism to reconstruct individual-level data (i.e., microdata) from statistical tables (Dinur & Nissim, 2003). Building on that work, Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith developed differential privacy in a 2006 paper (Dwork et al., 2006). Differential privacy provides a mathematically provable lower bound on the amount of individual information leaked through the publication of statistical tables. The formal definition of differential privacy is (Equation 1):

\frac{Pr⁡[M(d) \in S]} {Pr⁡[M(d' )\in S]} ≤e^\varepsilon \tag{1}

where $M$ is a differentially private algorithm, $d$ and $d'$ are neighboring databases that contain different values for one record (row) and the same values for all other records, $S$ is a set of output from the algorithm, and $\varepsilon$ is the parameter controlling the “degree of privacy offered by $M$ ” (Reiter, 2019).

An algorithm $M$ is considered $\varepsilon$ -differentially private if its outputs from $d$ and $d'$ are similar, with similarity defined by $e^\varepsilon$ . $\varepsilon$ has an inverse relationship with the strength of the privacy protection offered by algorithm $M$ : as the value of $\varepsilon$ decreases, the strength of the privacy protection increases. Intuitively, as $\varepsilon$ approaches 0, the algorithm $M$ will yield the same output regardless of the input databases $d$ and $d'$ , so no marginal information is revealed, and privacy is protected. It also implies there must necessarily be bias, since the differences between $d$ and $d'$ are no longer reflected in observable output. Conversely, as $\varepsilon$ approaches infinity, the set of output will necessarily be different, and it is always possible to differentiate the inputs $d$ and $d'$ ; no privacy has been imposed.

Moreover, as the set of output in $S$ increases, the privacy-loss budget (PLB)—the amount of privacy loss willing to be tolerated as encapsulated by $\varepsilon$ —needs to be allocated across all possible queries of $M(d)$ . Potential queries that receive little of this budget inherently privilege privacy over data accuracy. In other words, the PLB governs the trade-off between privacy (queries receive more random noise injection) and accuracy (queries receive less random noise injection).

This article focuses on the implementation and impacts of the Census Bureau’s differential privacy algorithm, known as the Disclosure Avoidance System (DAS), on statistics created from census data.5 The DAS implements the Census TopDown Algorithm to generate differentially private microdata, but imposes at least two additional constraints beyond the standard differential privacy algorithm: (1) certain output statistics $M(d)$ , such as state population totals, must match the unperturbed data $d$ exactly—such output statistics are called ‘invariants’; and (2) any differentially private individual-level microdata released must be consistent, when aggregated, with published statistics. Details about the TopDown Algorithm are available in Leclerc (2019b) and Abowd et al. (2019).

The TopDown Algorithm consists of a noise-injection step that generates differentially private counts for geographic units and a postprocessing step that converts the noisy counts into microdata. The noise-injection step takes a set of parameters—the overall or global PLB, the fractional allocation of the PLB to geographic levels, the fractional allocation of the PLB to queries—and returns differentially private counts. These counts may contain negative values and will be inconsistent, both within and across geographic levels. Additionally, the constraint that certain statistics be invariant will not be satisfied.

The postprocessing step solves an optimization problem that minimizes differences among the noisy counts and enforces constraints such as nonnegativity, consistency, and invariants. This step returns differentially private ‘synthetic’ microdata that may be aggregated over geographic units. Aggregations from microdata will be consistent within and across geographic levels and cross-tabulations.

The accuracy of the aggregated counts is a function of noise-injection parameters and the postprocessing routine. With respect to the noise-injection parameters, for a given set of fractional allocations and constraints, accuracy improves as the global PLB increases. For a given global PLB and set of constraints, accuracy for a given geographic level (e.g., state, county, enumeration district) or query improves as the fractional allocation to the level or query increases. Unfortunately, improved accuracy for a given level or query comes at a cost. Since the global PLB is fixed, accuracy for other levels or queries will decrease.

With respect to postprocessing, errors are deliberately introduced in the optimization procedure to produce microdata that satisfy all required constraints. The optimization solver uses nonnegative least squares to generate the best-fitting nonnegative set of counts. This solution yields positive bias for small counts and negative bias for large counts and, according to the Census Bureau, is the source of a large share of discrepancies between DAS-produced population counts and those produced from previous disclosure avoidance methods (Gong, 2020; Hotz & Salvo, 2020; Sexton, 2019).6 Nonetheless, relative to the noise injection of differential privacy, considerably less is known about the theoretical or empirical properties of the postprocessing procedure.

If the ‘error’ from the TopDown Algorithm were solely a function of the unbiased noise injected into the counts, it would be possible to estimate error bounds for the counts, as parameters controlling the shape of the noise distribution are known. Unfortunately, the error is a function of both unbiased noise injection and postprocessing, and there is no straightforward method to account for both sources of error.

3. Data

For our analyses, we use four data sets: one representing ‘ground truth’ microdata, and three that are differentially private. Our ground truth data set is the 1940 full-count census microdata, recently released in machine-readable format by IPUMS after a large digitization effort of scanned copies (PDFs) of original enumeration forms that were released by the Census Bureau in 2012 (Ruggles et al., 2018). This latter data set is also used as the input ( $d$ ) for each differentially private data set.

Our ground truth represents the raw census data as collected by enumerators (and subsequently digitized). No disclosure avoidance techniques have been applied to the ground truth data set. We compare the output from the DAS with these raw census counts to assess the impact of the noise injection and postprocessing on the accuracy of population counts and segregation indices. Comparing the 1940 full-count data to the differentially private outputs provides insights into the accuracy we may expect in the 2020 Decennial Census data. We readily acknowledge that the Census Bureau has long used other disclosure avoidance techniques, such as cell suppression or record swapping, to protect respondent privacy, and that these techniques also introduce error into population counts and other statistics. However, we are unable here to assess the impact of these techniques on accuracy. The 72-year rule prohibits access to original, unedited census data within the past 72 years, so the 1940 data are the most recent, machine-readable full-count release.7 However, McKenna (2018, p. 11) states that “a confidential research study found that the impact of introducing [disclosure avoidance] error into the estimates was much smaller than errors from sampling, non-response, editing, and imputation.” Nonetheless, we wish to make explicit that our empirical exercise contrasts just those statistics produced from originally enumerated data and those produced by the Census Bureau’s modern DAS, not those produced by previous and current disclosure avoidance techniques.8

The three differentially private data sets were created by applying the Census Bureau’s DAS to the 1940 full-count census microdata. These differentially private data sets did not support every field or variable in the original microdata file. Rather, as a testbed, only certain fields were processed in the DAS: geography, group quarters or household status, race, Hispanic status, and whether a record was of voting age (18-plus) or not.9 While the bureau released versions of the differentially private data files on their website,10 they also published the DAS source code in a GitHub repository, with instructions for installing and running the code on an Amazon Web Services EC2 instance (2020 Census DAS Development Team, 2019).

The DAS code allows users to allocate the privacy-loss budget to four geographic levels (nation, state, county, and enumeration district) and three queries (detailed, voting age-Hispanic-race, and household-group quarters). The fractional allocations to the geographic levels must sum to one, and the fractional allocations to the queries must also sum to one.11

The four geographic levels exhaustively subdivide the entire United States and are fully nested in one another. In other words, each part of the United States belongs to the nation, a single state, a single county, and a single enumeration district. Enumeration districts (groups of blocks) nest within counties, which nest within states, which then nest within the nation.

Queries are essentially cross-tabulations. The detailed query is created by cross-tabulating the variables for voting age, Hispanic status, race, and household/group quarters type. For example, one could query the number of voting-age, Black, non-Hispanics living in households. The voting age-Hispanic-race query is similar but does not separate by household-group-quarters type. The household-group quarters query simply differentiates between household and group-quarters type. Categories for these microdata variables are listed in Table 1.

Table 1. Microdata variables and their categories used in the census DAS queries.

Variable	Categories
Voting age [2]	Under 18 years of age
	18 years of age or older

Hispanic [2]	Not Hispanic
	Hispanic

Race [6]	White
	African American
	American Indian/Alaska Native
	Chinese
	Japanese
	Other Asian or Pacific Islander

Household/group quarters [8]	Household
	Correctional institutions
	Mental institutions
	Institutions for the elderly, handicapped, and poor
	Military
	College dormitory
	Rooming house
	Other non-institution GQ and unknown

^Note.^{These microdata variables undergo disclosure avoidance in the Census Disclosure Avoidance System (DAS). DAS output variable names differ from the ones listed in the table. Readers are directed to the}^{2018 End-to-End Test Disclosure Avoidance System Design Specification,}^{version 1.2.8 (Census Bureau, 2019), for more details about the output variable names and code. The values in the brackets indicate the number of categories for the variables.}
The detailed query consists of 192 cells: two voting age categories by two Hispanic categories by six race categories by eight household/group quarters categories. The voting age-Hispanic-race query consists of 24 cells: two voting age categories by two Hispanic categories by six race categories. The household-group quarters query consists of eight cells: eight household/group quarters categories.

The Upjohn team used three of the differentially private files produced by the Census Bureau’s own application of DAS, focusing for simplicity on those with privacy-loss budgets ( $\varepsilon$ ) of 0.25, 1.0, and 8.0.

In addition to using the differentially private files created by the Census Bureau, the IPUMS team created two additional sets of differentially private files by executing the DAS on an Amazon Web Services EC2 instance. The first set, consisting of four data files, was based on a fixed PLB ( $\varepsilon$ ) of 1.0 and had the same default fractional allocations across queries as the Census Bureau-produced files. The geographic level fractional allocation, however, varied for each data file (Table 2b). For example, the enumeration district run was based on an allocation of [0.85, 0.05, 0.05, 0.05] to the enumeration district, county, state, and nation, respectively. This relative allocation was repeated for each subsequent geographic level, yielding four data files.

Table 2. Fractional allocation of the privacy-loss budget (PLB) for three differentially private data sets.

a. Census DAS
Geographic level		Fraction
	Nation	0.25
	State	0.25
	County	0.25
	Enumeration district	0.25
Query
	Voting age–Hispanic–Race (VHR)	0.675
	Household–Group quarters (HHGQ)	0.225
	Detailed (Detail)	0.100

b. IPUMS Query (global PLB = 1.0)
Query		VHR	HHGQ	Detail
	Voting age–Hispanic–Race (VHR)	0.90	0.05	0.05
	Household–Group quarters (HHGQ)	0.05	0.90	0.05
	Detailed (Detail)	0.05	0.05	0.90
Geographic level
	Nation	0.25	0.25	0.25
	State	0.25	0.25	0.25
	County	0.25	0.25	0.25
	Enumeration district	0.25	0.25	0.25

c. IPUMS Geographic level (global PLB = 1.0)
Geographic level		Nation	State	County	ED
	Nation	0.85	0.05	0.05	0.05
	State	0.05	0.85	0.05	0.05
	County	0.05	0.05	0.85	0.05
	Enumeration district	0.05	0.05	0.05	0.85
Query
	Voting age–Hispanic–Race (VHR)	0.675	0.675	0.675	0.675
	Household–Group quarters (HHGQ)	0.225	0.225	0.225	0.225
	Detailed (Detail)	0.100	0.100	0.100	0.100

^Note^{. Panels}^a-c^{show the fractional allocation of the PLB to geographic levels and queries for the three differentially private data sets used in the article. Panel}^a^{is the allocation used for the Census Disclosure Avoidance System (DAS) data sets published by the U.S. Census Bureau in June 2019. Panel}^b^{is the allocation used for the IPUMS query data set. We assign 0.90 of the PLB to one query and 0.05 to the remaining queries. We then repeat the patterns for the remaining two permutations. Panel}^c^{is the allocation used for the IPUMS geographic-level data set. We assign 0.85 of the PLB to one geographic level and 0.05 to the remaining geographic levels. We then repeat the pattern for the remaining three permutations. The Census DAS data set used the allocation depicted in Panel}^a^{for eight different global PLBs—0.25, 0.50, 0.75, 1.0, 2.0, 4.0, 6.0, and 8.0. For the two IPUMS data sets, each column in a given panel represents the parameters used in a given run of the DAS. Column headers are the names we assigned to each run.}

The second data set, consisting of three data files, was based on a fixed PLB $(\varepsilon)$ of 1.0 and an equal allocation of 0.25 to each of the four geographic levels. The query fractional allocation, however, varied for each data file (Table 2c). For example, the detail run was based on an allocation of [0.90, 0.05, 0.05] to the detailed, voting age by Hispanic origin by race, and household/group quarters queries, respectively. This relative allocation was repeated for each of the other query types, yielding three data files.

All data files were analyzed in Stata and R.12

4. Methods

Our approaches are straightforward. In the first set of analyses, we compare population counts by geographic area, overall and for racial groups, across data sets. Understanding how differential privacy affects something as fundamental—and as highly visible—as population is of first-order importance. Accuracy in counting people—the very thing the census is designed to do—is critical. Less-populated areas or relatively small racial groups, for which even small errors can have large consequences, are of particular interest.

In the second set of analyses, we construct indices of segregation, which measure the distribution of one population group to another across space and compare these between the full-count data and differentially private versions thereof. These measures are important not only because they document social structures such as racial and economic segregation, but because their construction involves functions of several population estimates, and the PLB allocation is not currently designed to account for output from these kinds of functions. Segregation indices thus may illustrate how differential privacy affects estimates of complex statistics for which the DAS algorithm was not designed to preserve but that are still used for research and public policy.

4.1. Population Differences

Differential privacy’s noise injection mechanism can produce counts of populations—especially for subgroups—at all levels of the geographic hierarchy that differ from those observed in the original data. The magnitude of these differences is influenced by the PLB and its fractional allocation to geographic levels and queries.

We start our analysis with the default PLB allocations of the differentially private files provided by the Census Bureau; these allocations are given in Table 2a. To calculate population differences between the IPUMS full-count file and the differentially private files, we adopt two complementary approaches. First, we measure relative differences by taking natural logarithms of the population counts from both the differentially private file and the IPUMS full-count file, and then subtracting the former from the latter. This metric (when scaled by 100) approximates the percentage difference between the two measures.13 This approach allows comparisons across geographies of different size. However, it is not strictly additive across those geographies. Thus, in the second approach we measure mean absolute error, the absolute value of the difference between the two population counts. Both approaches are applied to total population counts as well as separately to those for White and African American populations. While our focus is on county-level counts (which lend themselves to visual maps), we also provide estimates on other geographies in tabular format.

Furthermore, to better understand how the fractional allocation of PLB affects differences in population counts, we also calculate the mean absolute error for the total, White, and African American populations for the differentially private data sets created by IPUMS, in which the PLB allocations vary, as described above (see also Table 2b and 2c for details).

4.2. Segregation

For geographies with few people from a given subgroup, differential privacy’s noise injection may create substantial relative bias, as a small absolute error can be magnified when the quotient is also small. Statistics based on estimates from multiple subgroups may be particularly susceptible to these biases.14 To assess the impact of differential privacy on such derived statistics, we calculate three commonly used segregation indices—index of dissimilarity (D), multigroup entropy (H), and isolation (B)—using both the IPUMS full-count data and the differentially private data sets. For simplicity, we focus on index values at the county level. Results for multigroup entropy and isolation are similar to those for the index of dissimilarity; therefore, to save space we focus here on D, but our description and discussion of the multigroup entropy and isolation indices can be found in the Appendix.

Index of Dissimilarity. The primary measure of segregation we use is the index of dissimilarity. This measure, abbreviated D, compares the distribution of two population subgroups among a set of smaller geographies nested in a larger geography (Iceland et al., 2002; Massey & Denton, 1988). In our setting, the smaller geographies are enumeration districts, which nest within a county, and the two population subgroups are Whites and African Americans.15 D is calculated as:

D = \ \frac{1}{2}\text{\ \ }\sum_{i = 1}^{n}\left| \frac{w_{i}}{w_{cty}} - \ \frac{b_{i}}{b_{cty}} \right| \tag{2}

where $n$ is the number of enumeration districts in a county, $w_i$ is the White population in enumeration district $i$ , $w_{cty}$ is the White population in the county, $b_i$ is the African American, or Black, population in enumeration district $i$ , and $b_{cty}$ is the Black population in the county.

Values of D may vary from 0 to 1. Dissimilarity is minimized (value equals 0) when the proportions of subgroups for each enumeration district match those of the county. Dissimilarity is maximized when all enumeration districts contain only a single population subgroup. Intuitively, D provides the fraction of a group that would need to move from district $i$ , in order for the population to match the distribution in the county.

D is one of the most widely used measures of segregation (Frey & Myers, 2005), but it is not without issues. Notably, D can be very sensitive to small changes, and measurement error in low-population places can cause upward bias (Napierala & Denton, 2017). A key problem occurs when one of the subgroups is completely absent in the county; by convention, D takes the value of 0 in this case, but if a single person from that subgroup moves in (or measurement error falsely assigns one person of that subgroup to a district), the index can flip from 0 to nearly 1. Likewise, in a county that is (near-) perfectly segregated across districts, moving (or misreporting) one person in a district that previously has no members of that group can substantively shift the index downward. Thus, the potential for bias is high for small-population places if differential privacy imposes even slight amounts of measurement error.

5. Results

Our results section consists of three parts. The first examines the accuracy of county-level population counts, using relative (log) differences, for various values of $\varepsilon$ under the default PLB allocations (Table 2a). We consider the accuracy of three counts: total, White, and African American. The second section examines the accuracy of nation, state, county, and enumeration district counts, using mean absolute error, for three differentially private data sets—the data released by the Census Bureau and two data sets generated by IPUMS. The third section examines differences in the index of dissimilarity between the original complete-count 1940 data and the three differentially private data sets.

5.1. Accuracy in Population Counts for Counties Under Default PLB Allocations

We begin the comparisons of population estimates between the IPUMS full-count and differentially private files by graphically examining county-level relative (log) differences, using the differentially private files produced by the Census Bureau, with global PLB ( $\varepsilon$ ) values of 0.25, 1.0, and 8.0, as well as the default PLB allocations given in Table 2a.

Figure 1a shows maps of these county-level differences of total population for the three values of $\varepsilon$ . When this parameter is 8.0, the global PLB is relatively large, and the map in the lower right shows that only a handful of counties—out of roughly 3,000—have a population discrepancy of more than 0.5%. In many cases, the discrepancy is even smaller—closer to 0.1% or less. However, as $\varepsilon$ shrinks, differences in total population estimates become more common and more sizable. The map in the upper right of Figure 1a shows a mix of over- and underestimates of the differentially private data when $\varepsilon$ = 1, mostly in the sparsely populated West. As $\varepsilon$ falls to 0.25, as in the upper left panel, population discrepancies become common, and many counties have differences of 5% or more.

(a)

(b)

^{Figure 1. Differences in Log Population between IPUMS full-count and Disclosure Avoidance System (DAS) data files, by county, 1940 Census}^{. The figures display the difference in the (natural) log county-level population between the IPUMS 1940 Census full-count file and three differentially private files produced by the Census Bureau—those with} $\varepsilon$ ^{values of 0.25 (upper left), 1.0 (upper right), and 8.0 (lower left). When multiplied by 100, these differences are approximately percentage differences. Counties in blue indicate population estimates from the IPUMS full-count file are greater than in the differentially private files; counties in red indicate the reverse. Figure}^a^{shows all counties in the United States, and figure}^b^{shows counties with a population (in the IPUMS full-count file) over 18,000 (according to IPUMS, this is roughly the median county population in 1940).}

Of course, differential privacy’s noise infusion works in absolute numbers, and relative population differences will be concentrated in less-populated counties, where even small discrepancies in counts can meaningfully affect percentages. This can be seen more clearly in Figure 1b, which shows only counties with total population of 18,000 or more (according to the IPUMS full-count file)—about the median county size in 1940. For these counties—the most populous half—the maps are nearly empty when $\varepsilon$ is at least 1, indicating only very minor population discrepancies between the IPUMS full-count and differential privacy files. When $\varepsilon$ is as low as 0.25, though, many counties continue to show population differences of at least 1%, or roughly 180 people for the median-size county—a nontrivial amount.

Further, because the privacy-loss budget must be split across queries, population estimates for subgroups—Whites and African Americans—may be even more prone to error. Figures 2a and 2b and 3a and 3b repeat Figures 1a and 1b, except that they represent differences in population estimates for Whites and African Americans, respectively. Figure 2a shows that population estimates for Whites from the differentially private data files are reasonably close to those from the IPUMS file for most counties, but underestimates (in blue) are still more common—even for $\varepsilon$ = 8—than for total population estimates. Indeed, even for counties with above-median population, shown in Figure 2b, several counties in the South show an underestimate of 1% or more. As with Figures 1a and 1b, the discrepancies grow larger and more frequent as $\varepsilon$ declines. The picture changes dramatically, however, with estimates of the African American population in Figure 3a. For this group, population estimates vary considerably for a majority of counties, even when $\varepsilon$ = 8 and, outside the Deep South, even for counties with above-median population (Figure 3b).

(a)

(b)

^{Figure 2}^.^{Differences in Log White Population between IPUMS full-count and Disclosure Avoidance System (DAS) data files, by county, 1940 Census.}^{The figures display the difference in the (natural) log county-level population of Whites between the IPUMS 1940 Census full-count file and three differentially private files produced by the Census Bureau—those with} $\varepsilon$ ^{values of 0.25 (upper left), 1.0 (upper right), and 8.0 (lower left). Figure}^a^{shows all counties in the United States, and figure}^b^{shows counties with a population (in the IPUMS full-count file) over 18,000 (according to IPUMS, this is roughly the median county population in 1940). See note to Figure 1 for details about the map legend.}

(a)

(b)

^{Figure 3}^.^{Differences in Log African American Population between IPUMS full-count and Disclosure Avoidance System (DAS) data files, by county, 1940 Census.}^{The figures display the difference in the (natural) log county-level population of African Americans between the IPUMS 1940 Census full-count file and three differentially private files produced by the Census Bureau—those with} $\varepsilon$ ^{values of 0.25 (upper left), 1.0 (upper right), and 8.0 (lower left). Figure}^a^{shows all counties in the United States, and figure}^b^{shows counties with a population (in the IPUMS full-count file) over 18,000 (according to IPUMS, this is roughly the median county population in 1940). See note to Figure 1 for details about the map legend.}

Table 3 quantifies these differences for the total population, the population of Whites, and the population of African Americans. For total population, the average difference across counties between the IPUMS full-count file and the differentially private file is 0.22% when $\varepsilon$ = 0.25 and a trivial 0.01% when $\varepsilon$ = 8.16 However, extreme cases are sensitive to $\varepsilon$ : the bottom and top 1% of county differences—about 30 counties on either end—have discrepancies of at least 3.8−6.4% when $\varepsilon$ = 0.25, but only 0.2−0.3% when $\varepsilon$ = 8. Differences in the estimates of the White population show roughly similar dispersion, with the bottom and top 1% of county differences on the order of 5% when $\varepsilon$ = 0.25, and generally less than 1% when $\varepsilon$ = 8. (Expressed differently, and not shown in the table, the share of counties with a discrepancy of at least 1% falls from 19.2% when $\varepsilon$ = 0.25 to 1.3% when $\varepsilon$ = 8). For African Americans, population estimates between the data sources converge much more slowly as $\varepsilon$ increases, with the mean difference falling from a 16.0% overcount in the differentially private file relative to IPUMS full-count when $\varepsilon$ = 0.25, to a 3.3% undercount when $\varepsilon$ = 8. Moreover, extreme cases are troubling throughout the range of $\varepsilon$ , with differences of several multiples in small counties with few African Americans. Even in counties with African American population above the median (not shown in the table), differences in population estimates of at least 1% occur in 5.4% of counties when $\varepsilon$ = 8 and 58.2% of counties when $\varepsilon$ = 0.25.

Table 3. Distribution of county-level relative (log) population differences between IPUMS full-count and differentially private data files.

	M	SD	P1	P50	P99
Total Pop: $\varepsilon = 0.25$	-0.0022	0.0285	-0.0661	-0.0001	0.0375
Total Pop: $\varepsilon = 1.0$	-0.0007	0.0094	-0.0195	-0.0001	0.0099
Total Pop: $\varepsilon = 8.0$	-0.0001	0.0029	-0.0027	0.0000	0.0015
White Pop: $\varepsilon = 0.25$	0.0011	0.0374	-0.0451	0.0002	0.0549
White Pop: $\varepsilon = 1.0$	0.0009	0.0213	-0.0105	0.0002	0.0167
White Pop: $\varepsilon = 8.0$	0.0004	0.0049	-0.0021	0.0001	0.0104
African American Pop: $\varepsilon = 0.25$	-0.1620	0.8940	-3.8430	-0.0002	2.0788
African American Pop: $\varepsilon = 1.0$	-0.0878	0.6293	-2.8090	0.0000	1.6548
African American Pop: $\varepsilon = 8.0$	-0.0337	0.3348	-1.3860	0.0000	0.7168

^Note^{. The table shows selected characteristics (mean, standard deviation, 1st percentile, 50th percentile, and 99th percentile) of relative (log) population differences between the IPUMS full-count 1940 Census file and three differentially private files produced by the Census Bureau—those with} $\varepsilon$ ^{values of 0.25, 1.0, and 8.0. See note to Figure 1a.}

5.2. Accuracy in Population Counts for Geographies Under Varying PLB Allocations

The preceding section examined relative differences between the IPUMS full-count data and the Census Bureau’s differentially private data set. We next consider absolute differences between the IPUMS full-count and the three differentially private data sets described in Table 2. Since the noise infusion and postprocessing mechanisms in the Census Bureau’s differentially private algorithm adds or subtracts whole numbers to the precise counts, absolute differences provide insights into how the counts differ following the disclosure avoidance procedure.

Here, we assess accuracy by comparing nation, state, county, and enumeration district-level counts of the total, White, and African American population from differentially private data sets with the IPUMS full-count data. We compute the counts from all data sets for the nation, 50 states and the District of Columbia, 3,108 counties and 134,857 enumeration districts (ED) in the 1940 census. If a particular subgroup was not found in a given county or ED in a data set (in either the differential private data sets or the IPUMS full-count data set), we add a record to the tabulation and sets its value to zero. We then take the absolute value of the difference between the differentially private and IPUMS counts. Finally, we compute the mean absolute error for each geographic level.

Table 4 summarizes the differences between the data sets for total population. For the Census Bureau’s differentially private data set (4a), the mean absolute error for the nation and state levels is the same for all PLB values. Mean absolute error for counties and EDs decreases as the PLB increases. Examining the results for the data set where we vary the fractional allocation by geographic level (4b), the mean absolute error for the nation and state levels is again the same for all allocations. Mean absolute errors for counties and EDs are lowest when those geographic levels receive the largest fraction of the PLB. Finally, for the data set where we vary the fractional allocation by query (4c), mean absolute errors for counties and EDs are lowest when the household–group quarters query receives the largest fraction of the PLB.

Table 4. Mean absolute errors in total population counts between IPUMS full-count and differentially private data files, by geography and privacy-loss budget (PLB) allocation.

a. DAS
Geographic Level	Nation	State	County	ED
0.25	1	2	105.5	71.0
0.50	1	2	57.2	37.5
0.75	1	2	39.2	25.9
1.0	1	2	31.2	20.0
2.0	1	2	16.7	11.0
4.0	1	2	9.3	6.5
6.0	1	2	6.9	5.1
8.0	1	2	5.7	4.4

b. Geog. Levels (global PLB = 1.0)
Nation	1	2	130.9	87.2
State	1	2	126.4	86.9
County	1	2	13.4	86.6
Enumeration district	1	2	133.4	9.4

c. Queries (global PLB = 1.0)
Detailed	1	2	35.7	18.3
Voting age–Hispanic–Race	1	2	29.4	18.3
Household–Group quarters	1	2	14.7	9.8

Mean population	132,404,766	2,596,172	42,601	982
Median population	132,404,766	1,903,131	18,679	865

^Note^{. Panel}^a^{uses differentially private data files released by the Census Bureau, and rows reflect the global PLB. Panels}^b^and^c^{use differentially private data files created by IPUMS using the Disclosure Avoidance System (DAS) procedure, with global PLB of 1.0 and allocations as shown in Table 2. ED = enumeration district. Mean and median total populations were calculated from the IPUMS full-count microdata.}

Table 5 summarizes the differences between the data sets for the White and African American populations. For the Census Bureau’s differentially private data set (5a), the patterns for the White and African American populations mirror those for total population. The mean absolute error for Whites and African Americans decreases as the PLB increases for state, counties, and EDs. For the geographic-level fractional allocation data set (5b), mean absolute error is lowest when a given geographic level receives the largest fraction of the PLB. Finally, for the query fractional allocation data set (5c), mean absolute error is highest for all geographic levels when the household–group quarters table receives the largest fraction of the PLB. For the nation, state, and county levels, mean absolute error is lowest when voting age–Hispanic–race receives largest fraction. The mean absolute error for EDs is the same for voting age–Hispanic–race and detailed queries.

Table 5. Mean absolute errors in total population counts between IPUMS full-count and differentially private data files, by race, geography, and privacy-loss budget (PLB) allocation.

	White				African American
Geographic level	Nation	State	County	ED	Nation	State	County	ED
a. DAS
0.25	70	83	79.0	63.9	55	88	53.8	23.0
0.50	37	54	41.4	33.4	11	38	30.3	13.5
0.75	17	38	28.8	22.9	3	33	21	9.7
1.0	27	21	21.9	17.6	9	18	16.4	7.6
2.0	3	12	11.9	9.6	25	10	8.8	4.3
4.0	6	6	6.7	5.8	15	5	4.8	2.5
6.0	5	6	5.1	4.6	0	4	3.3	1.8
8.0	9	4	4.4	4.0	3	3	2.6	1.4

b. Geog. Levels (global PLB = 1.0)
Nation	11	134	96.8	78.7	5	127	66.9	27.3
State	268	8	98.3	78.6	87	6	63.6	27.4
County	128	121	8.5	78.2	183	110	6.8	26.8
Enumeration District	17	102	99.5	8.2	109	103	68.2	4.3

c. Queries (global PLB = 1.0)
Detailed	67	41	23.7	15.1	37	40	16.4	6.5
Voting age–Hispanic–Race	37	19	18.7	15.1	14	20	13.6	6.6
Household–Group quarters	910	274	118.5	43.9	909	263	98.0	35.2

Mean population	118,803,172	2,329,474	38,225	881	12,663,581	248,306	4075	94
Median population	118,803,172	1,529,607	15,804	753	12,663,581	64,442	263	0

^Note^{. See note to Table 4.}

5.3. Assessing Measures of Segregation

We now examine how differential privacy affects the calculation of segregation measures, which as noted above are nonlinear functions of population counts, often for small groups or small areas (or both). To illustrate differences between differentially private data sets and the IPUMS full-count file, we employ a combination of maps, scatter plots, and tables, focusing most on dissimilarity (D). As it turns out, the results for the other measures (entropy and isolation) are broadly similar to those for D (see the Appendix for a detailed discussion of the entropy and isolation results).

Index of Dissimilarity.

Figure 4a shows several maps of county-level dissimilarity indices. The lower right panel shows baseline dissimilarity indices calculated from the IPUMS full-count file. Counties in darker red have greater dissimilarity, implying greater levels of segregation. Aside from some sparsely populated counties in the Plains and Mountain states, which have no recorded African American population, counties with lower dissimilarity tend to occur in the South. This reflects that much of the African American population lived in this region in 1940, and relative to other regions, was more evenly dispersed across enumeration districts within counties.

(a)

(b)

^{Figure 4}^.^{Differences in Dissimilarity Index between IPUMS full-count and Disclosure Avoidance System (DAS) data files, by county, 1940 Census.}^Figure^a^{shows all counties in the United States, and figure}^b^{shows counties with a population (in the IPUMS full-count file) over 18,000 (according to IPUMS, this is roughly the median county population in 1940). The figures display the difference in county-level dissimilarity indices, for Whites and African Americans, between the IPUMS 1940 Census full-count file and three differentially private files produced by the Census Bureau—those with e values of 0.25 (upper left), 1.0 (upper right), and 8.0 (lower left). Counties in blue indicate that the differential privacy dissimilarity index is lower than that from the IPUMS file (negative bias); counties in red indicate the opposite (positive bias). The map in the lower right shows county-level dissimilarity indices calculated from the IPUMS full-count file (not differenced) for context. Darker red areas have higher dissimilarity index values.}

The other three panels reflect differences in dissimilarity indices between the IPUMS full-count file and each of three of the Census Bureau–provided differential privacy files in which $\varepsilon$ = 0.25 (upper left), $\varepsilon$ = 1.0 (upper right), and $\varepsilon$ = 8.0 (lower left). As with the earlier figures, counties in blue indicate the IPUMS full-count measure is higher than that calculated form the differential privacy file, and counties in red indicate the opposite. Expressed differently, the darker blue is a county, the greater the underestimate of dissimilarity in the differential privacy file relative to the IPUMS full-count, and the darker red the greater the overestimate.

While differences in the South are minor across values of $\varepsilon$ , the story in other regions is quite different. When $\varepsilon$ = 0.25, the majority of counties in other regions show substantial differences. Both under- and overestimates of dissimilarity are common, although the former are more frequent and more severe. As $\varepsilon$ increases, differences shrink but are still apparent in a large number of counties even when $\varepsilon$ = 8; moreover, the overestimates of dissimilarity seem to diminish more than the underestimates, leading to an average downward bias in dissimilarity from the differentially private files despite a generous PLB (high $\varepsilon$ ).

Of course, as we have seen before, the induced error of differential privacy is greater for less populous places, so Figure 4b restricts the maps to counties with above-median population. Although these larger counties show less bias—especially as $\varepsilon$ increases—there is still a fair amount present, particularly in the Midwest. Interestingly, the pronounced degree of underestimated dissimilarity is no longer present, and if anything, overestimates are now slightly more common. This suggests downward bias is more concentrated in less populated counties.

Furthermore, because African Americans were heavily concentrated in the South in 1940, Figure 5 zooms in on former slaveholding states, using a finer scale of differences than in Figures 4a and 4b. For this region, differences shrink more dramatically as $\varepsilon$ increases, and the largest differences that remain when $\varepsilon$ = 8 are in counties where the IPUMS full-count dissimilarity index was close to 0 or 1, indicating how sensitive the measure can be to extreme cases.

^{Figure 5}^.^{Differences in Dissimilarity Index between IPUMS full-count and Disclosure Avoidance System (DAS) data files, by county in former slave-owning states, 1940 Census.}^{The figure displays the difference in county-level dissimilarity indices, for Whites and African Americans, between the IPUMS 1940 Census full-count file and three differentially private files produced by the Census Bureau—those with} $\varepsilon$ ^{values of 0.25 (upper left), 1.0 (upper right), and 8.0 (lower left). Unlike Figure 4, only counties from former slave-owning states are included. Otherwise see note to Figure 4, but note the scale differs.}

Table 6 expands on the differences captured in the maps of Figures 4a, 4b, and 5, by now calculating D using Whites and non-Whites (not just African Americans) to allow us to survey a slightly broader set of counties, particularly counties in the Plains and Mountain West. The left-most column lists the number of counties with valid values of D for both the differentially private and IPUMS data.17 The middle column lists the number of counties whose D value computed from the differentially private data is less than their D value from the IPUMS data. The right-most column shows the percentage of counties with D_{diff. private} < D_IPUMS. As lower values of D indicate less segregation when we observe counties with D_{diff. private} < D_IPUMS, the differential privacy algorithm generates data showing less segregation than was observed in the full-count data.

Table 6. Difference between index of dissimilarity, D, between Whites and non-Whites, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget (PLB) allocation.

a. DAS
	Counties with valid values of D	Counties with D_{diff. private} < D_IPUMS	% of counties with D_{diff. private} < D_IPUMS
0.25	2,742	1,098	40.0
0.50	2,777	1,422	51.2
0.75	2,799	1,563	55.8
1.0	2,827	1,738	61.5
2.0	2,853	1,926	67.5
4.0	2,857	2,096	73.4
6.0	2,853	2,121	74.3
8.0	2,862	2,094	73.2

b. Geog. Levels (global PLB = 1.0)
Nation	2,720	950	34.9
State	2,683	974	36.3
County	2,856	1,099	38.5
Enumeration District	2,690	1,725	64.1

c. Queries(global PLB = 1.0)
Detailed	2,840	1,822	64.2
Voting age–Hispanic–Race	2,823	1,802	63.8
Household–Group quarters	2,764	865	31.3

^Note^{. See note to Table 4.}

For the Census Bureau’s differentially private data set (Table 6a), the percentage of counties with D_{diff. private}< D_IPUMSincreases as the global privacy budget increases from 0.25 to 4.0 and then plateaus at approximately 74%. This bias is also apparent when we plot the D values from the differentially private data against those from the IPUMS full-count data set (Figure 6). Each dot represents a county, and as $\varepsilon$ increases, we observe more dots to the right of the identity line.

^{Figure 6}^.^{Index of dissimilarity,}^D^{, for different global privacy loss budgets.}^{Panel labels are the global privacy loss budgets used to generate the differentially private data. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

The downward bias in D as $\varepsilon$ increases, also apparent in Figures 4 and 5, is likely due to two factors. First, the magnitude of the non-White counts is small, and, second, the noise injection is spreading out the non-Whites among enumeration districts in the county. A higher $\varepsilon$ value produces more accurate counts for enumeration districts and counties, but the randomness of the noise injection may allocate a few individual non-Whites into enumeration districts where there were none.18 This spreading process will reduce the value of D, especially in counties with smaller populations and in those with few non-Whites.

For the differentially private data set where we allocate a large fraction of the PLB to one geographic level (Table 6b), the percentage of counties with D_{diff. private}< D_IPUMSranges from 34% to 38% for the nation, state, and county allocations. The percentage jumps to 64% when we allocate 0.85 of the PLB to the enumeration district level. Figure 7 plots the D values from these comparisons. We observe more counties to the left of the identity line for the nation, state, and county allocations and more counties to the right for the enumeration district allocation.

^{Figure 7}^.^{Index of dissimilarity,}^D^{, for varying fractional allocations to geographic levels.}^{Panel labels are the geographic levels receiving a 0.85 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

The variation in D for counties with a minority population comprising greater than 10% of their total population (indicated by the orange dots on Figure 7) is higher for the nation, state, and county allocations than it is for the enumeration district allocation. The bias for the nation, state, and county allocations differs by the size of the minority population. When the non-White population is greater than 10% of the county’s total population, D is higher in the differentially private data than it is in the IPUMS full-count data. When the minority population is less than 10% of the county’s total population, D is lower in the differentially private data than it is in the IPUMS full-count data.

For the differentially private data set where we allocate a large fraction of the PLB to a single query (Table 6c), the percentage of counties with D_{diff. private}< D_IPUMSis 64% for the detailed and voting age–Hispanic–race allocations. The percentage declines to 31% when the household–group quarters query received 90% of the PLB. When we plot those values of D, more counties are to the right of the identity line for the detailed and voting age–Hispanic-race queries, and more counties to the left for the household–group quarters query (Figure 8).

^{Figure 8}^.^{Index of dissimilarity,}^D^{, for varying fractional allocations to queries.}^{Panel labels are the queries receiving a 0.90 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

We also observe more variation in D when the household–group quarters query receives the bulk of the PLB (Figure 8). Cells that include race will have more noise injected into them, increasing the variation of measures like D that use those cells. When the detailed or voting age–Hispanic–race queries receive the bulk of the PLB, race counts will be more accurate, thereby reducing variation in D, particularly for counties with a minority population comprising more than 10% of their total population (orange dots in Figure 8).

Results for multigroup entropy (H) and isolation (B) are similar to the index of dissimilarity. The Appendix includes a discussion of those results along with tables and figures.

6. Discussion

Using a differentially private algorithm to protect respondent privacy has implications for the accuracy and utility of published statistics. The accuracy of the data is influenced by the parameter values, including the global privacy-loss budget ( $\varepsilon$ ) and its allocation to geographic levels and queries, and the postprocessing routine that converts noisy measurements into microdata. As the PLB increases, the overall accuracy for geographic levels improves given a particular allocation of the PLB to geographic levels and queries. If we fix the PLB and then modify its allocation to geographic levels or queries, we may improve the accuracy for a geographic level or query by providing it a larger allocation. Of course, with a fixed PLB, an accuracy improvement for a particular level or query results in decreased accuracy for other levels or queries.

The variation in accuracy based on parameter values highlights the fundamental importance of these values. Since these values are set by humans, it is important to consider the decision-making process behind the value selection. What are the criteria used to set the parameters? Who helps determine those criteria? Transparency about the process will build trust with data users and help them understand the trade-offs the decisions makers had to make. For the two differentially private data sets created by IPUMS, the team chose an arbitrary PLB ( $\varepsilon$ = 1) and fractional allocations that privileged a particular geographic level or query. By privileging a single level or query, the team could determine how accurate the privileged entity would be and how inaccurate the nonprivileged entities would be.

While the overall accuracy improves as the PLB increases, we still observe large relative errors for less populous geographies or small subpopulations. As depicted on the maps in Figures 1a, 2a, and 3a, the largest relative errors for total population occur in the sparsely populated Western states, and the relative errors for African Americans are larger than for Whites. Given that the decennial census provides the most reliable counts for small places and small subpopulations, providing data with large relative errors will hinder scientists’ ability to study these demographics and for policymakers to develop effective, useful plans for these groups and places.

Residential segregation is often computed from decennial census data. Commonly used indices—dissimilarity, entropy, isolation—are sensitive to the differentially private noise injection and postprocessing of the Census DAS algorithm. While on average, this process leads to lower estimates of segregation, some cases show higher estimates of segregation. More generally, statistics that are sensitive to measurement error may become entirely impractical with data that have had the DAS applied, especially in rural places or when studying low-population groups. We will thus need new measures that account for differentially private noise injection. If these new measures are not compatible with existing ones, our ability to study change over time and determine whether social phenomenon like residential segregation are improving or getting worse may be hindered.

The Census Bureau has not yet finalized all aspects of the new system, but we nonetheless expect that our core concerns will remain. Since segregation indices rely on noise-injected counts of subpopulations from two separate levels of geography, and often from sparely populated areas within each level, bias is likely to remain even under further revisions to the TopDown Algorithm because the privacy-loss budget does not account for these kinds of functions. More broadly, because the set of functions of statistics plausibly of interest to researchers or stakeholders is unbounded, while the PLB is decidedly finite, it is not mathematically possible to resolve the issue. Some complex statistics presumed by researchers to be meaningful under previous disclosure avoidance systems (even if mistakenly in some cases) may not be useful under the TopDown system whenever $\varepsilon$ is finite and some privacy protection is imposed. And this set has yet to be quantified.

7. Conclusion

The decennial census of population and housing provides critical data about the demographic and housing characteristics for millions of geographic units in the United States. These data are used by innumerable stakeholders to understand their communities, document disparities, study the impact of policy changes, and plan for the future. The Census Bureau’s adoption of a differentially private algorithm, which injects noise into published statistics to protect respondent confidentiality, will generate less accurate decennial data. Using the original full-count 1940 Decennial Census data and differentially private versions of the original data, we find that accuracy improves as the global privacy-loss budget increases, although relative accuracy for less populous geographies or smaller population subgroups may still be low. We also find that the accuracy for geographic levels and population subgroups is affected by the allocation of the global PLB. Finally, we observe variation in commonly used segregation measures depending on the magnitude of the PLB and its allocation to geographic levels and queries. Changing which geographic level or query receives a bulk of the PLB changes where we observe more or less segregation than in the original data.

An important caveat to these findings is that we do not directly consider how postprocessing, separate from the implementation of differential privacy, may further bias statistics generated by the census DAS. Postprocessing can ensure nonnegative population counts and consistent aggregations across subgroups, which are good things, but it comes at the cost of introducing further, harder-to-quantify bias into the population moments. Recent independent analyses—using 2010 Census data not yet available when we conducted the exercises in this article—have found that postprocessing is responsible for the majority of observed discrepancies between the originally released data and differentially private versions (Hotz & Salvo, 2020). The exact algorithms involved in data postprocessing are still being revised by the Census Bureau as of this writing, so it is unknown whether the specific issues we and others have identified will be resolved by the time of release of 2020 Census data sets, particularly the Demographic and Housing Characteristics file. More generally, the reliability of complex statistics that are functions of queries scheduled for release is under studied—even though many researchers and other stakeholders may take this reliability for granted.

We strongly recommend that the Census Bureau make both the PLB and postprocessing methods as transparent as confidentiality concerns will allow, particularly to ensure accurate information can be obtained about small geographies and populations. As more information is released, we anticipate that future research will examine how to construct appropriate confidence intervals accounting for the addition of postprocessing methods to differentially private noise injection. We nonetheless believe that even further transparency and revisions to the differential privacy algorithms will still leave our core findings unchanged, as the Census Bureau’s differential privacy designs do not account for nonlinear functions of published statistics, such as segregation indices. Similarly, we will need more research—and quite possibly considerable caution—in making comparisons of population statistics across different censuses, spanning both censuses using the planned DAS (the 2020 and 2030 censuses, for example) and censuses using different disclosure avoidance methods (the 2010 and 2020 censuses, for example).

Under the planned DAS, both the variable accuracy of population counts under different global PLBs and different PLB allocations, and the sensitivity of commonly used social science statistics to differentially private noise injection and postprocessing, lead us to the following conclusions. It is critical that data producers like the Census Bureau seek input from potential users in setting DAS parameters and in determining which statistics receive greater shares of the PLB. Users must know how the choice of parameters affects published data, and it is incumbent upon producers to provide that guidance. Indeed, data producers should actively work with users to develop documentation and tools to evaluate and understand the reliability and validity of statistics derived from publicly released data. We expect new techniques will nonetheless have to be developed by external researchers to accommodate this new reality (Goroff & Groshen, 2020). Finally, social scientists, statisticians, and computer scientists must collaborate to create new or improved population measures that can accommodate more robustly the new DAS algorithm (Eltinge, 2020).

Acknowledgments

We are grateful for feedback from Erica Groshen, three anonymous reviewers, and participants at the 2019 Harvard Data Science Initiative conference.

Contributions

BA, BH, and DVR were responsible for conceptualization, project administration, methodology, writing – original draft, and writing – review & editing. DVR, SMR, and SY were responsible for data curation, formal analysis, and software. TAK, SR, and JS assisted with conceptualization.

Disclosure Statement

Support for the IPUMS team was provided by the Minnesota Population Center, which is funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (P2C HD041023). The Upjohn Team has nothing additional to disclose.

References

Abowd, J. (2018a). Protecting the confidentiality of America’s statistics: Adopting modern disclosure avoidance methods at the Census Bureau. U.S. Census Bureau. https://www.census.gov/newsroom/blogs/research-matters/2018/08/protecting_the_confi.html

Abowd, J. (2018b). Protecting the confidentiality of America’s statistics: Ensuring confidentiality and fitness-for-use. U.S. Census Bureau. https://www.census.gov/newsroom/blogs/research-matters/2018/08/protecting_the_confi0.html

Abowd, J., Ashmead, R., Garfinkel, S., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., & Zhuravlev, P. (2019). Census TopDown Algorithm: Differentially private data, incremental schemas, and consistency with public knowledge. U.S. Census Bureau. https://github.com/uscensusbureau/census2020-das-2010ddp/blob/master/doc/20191020_1843_Consistency_for_Large_Scale_Differentially_Private_Histograms.pdf

Abowd, J. M. (2016a). How will statistical agencies operate when all data are private? Journal of Privacy and Confidentiality, 7(3). https://doi.org/10.29012/jpc.v7i3.404

Abowd, J. M. (2016b). Why statistical agencies need to take privacy-loss budgets seriously, and what it means when they do [Presentation]. The 13th Biennial Federal Committee on Statistical Methodology (FCSM) Policy Conference, December 6–7, Washington, DC. https://digitalcommons.ilr.cornell.edu/ldi/32/

Bailie, J., & Chien, C.-H. (2019). ABS perturbation methodology through the lens of differential privacy. UNECE. http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S2_ABS_Bailie_D.pdf

Bambauer, J., Muralidhar, K., & Sarathy, R. (2014). Fool’s gold: An illustrated critique of differential privacy. Vanderbilt Journal of Entertainment & Technology Law, 16(4), 701–755. https://scholarship.law.vanderbilt.edu/jetlaw/vol16/iss4/1/

Brummet, Q., Mulrow, E., & Wolter, K. (2020). The effect of differentially private noise injection on sampling efficiency and funding allocations: Evidence from the 1940 Census [Working Paper]. NORC at the University of Chicago.

Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. In F. Neven, C. Beeri, & T. Milo (Eds.), Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 202–210). Association for Computing Machinery. https://doi.org/10.1145/773153.773173

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi & T. Rabin (Eds.), Theory of cryptography (pp. 265–284). Springer. https://doi.org/10.1007/11681878_14

Eltinge, J. L. (2020). Disclosure limitations in the context of statistical agency operations: data quality and related constraints [Working Paper]. U.S. Census Bureau.

Frey, W., & Myers, D. (2005). Racial segregation in US metropolitan areas and cities, 1990-2000: Patterns, trends, and explanations. University of Michigan Population Studies Center. https://www.psc.isr.umich.edu/pubs/abs/3199

Garfinkel, S. L., Abowd, J. M., & Powazek, S. (2018). Issues encountered deploying differential privacy. In D. Lie & M. Mannan (Eds.), Proceedings of the 2018 Workshop on Privacy in the Electronic Society (pp. 133–137). Association for Computing Machinery. https://doi.org/10.1145/3267323.3268949

Gong, R. (2020). Transparent Privacy is Principled Privacy (2006.08522). arXiv. https://doi.org/10.48550/arXiv.2006.08522

Gordon, T. (2019). The census is about nearly $1 trillion in federal spending, not just elections. Tax Policy Center. https://www.taxpolicycenter.org/taxvox/census-about-nearly-1-trillion-federal-spending-not-just-elections

Goroff, D. L., & Groshen, E. L. (2020). Disclosure avoidance and the 2020 Census: What do researchers need to know? [Working Paper]. Alfred P. Sloan Foundation and Cornell University.

Hotchkiss, M., & Phelan, J. (2017). Uses of Census Bureau data in federal funds distribution. U.S. Census Bureau. https://www.census.gov/library/working-papers/2017/decennial/census-data-federal-funds.html

Hotz, V. J., & Salvo, J. (2020). Assessing the use of differential privacy for the 2020 Census: Summary of what we learned from the CNSTAT Workshop. American Statistical Association. https://www.amstat.org/ASA/News/Co-Chairs-Share-Insights-Gleaned-from-2020-Census-Data-Products-Workshop.aspx

Iceland, J., Weinberg, D. H., & Steinmetz, E. (2002). Racial and ethnic residential segregation in the United States: 1980-2000 (CENSR-3; Census 2000 Special Reports, p. 151). U.S. Census Bureau. https://www.census.gov/prod/2002pubs/censr-3.pdf

Leclerc, P. (2019a). Reconstruction of person level data from data presented in multiple tables. Challenges and new approaches for protecting privacy in federal statistical programs. National Academies. http://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_193509.pdf

Leclerc, P. (2019b). Guide to the census 2018 end-to-end test Disclosure Avoidance Algorithm and implementation. U.S. Census Bureau. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0938_2018_E2E_Test_Algorithm_Description.pdf

Massey, D. S., & Denton, N. A. (1988). The dimensions of residential segregation. Social Forces, 67(2), 281–315. https://doi.org/10.2307/2579183

McKenna, L. (2018). Disclosure avoidance techniques used for the 1970 through 2010 Decennial Censuses of Population and Housing (No. 18–47; p. 39). U.S. Census Bureau. https://www2.census.gov/ces/wp/2018/CES-WP-18-47.pdf

Napierala, J., & Denton, N. (2017). Measuring residential segregation with the ACS: How the margin of error affects the dissimilarity index. Demography, 54(1), 285–309. https://doi.org/10.1007/s13524-016-0545-z

National Research Council. (1995). Modernizing the U.S. Census. National Academies Press. https://doi.org/10.17226/4805

R Core Team. (2018). R: A language and environment for statistical computing (3.4.4) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Reamer, A. (2019). Brief 7: Comprehensive accounting of census-guided federal spending (FY2017) (p. 6). George Washington Institute of Public Policy. https://gwipp.gwu.edu/sites/g/files/zaxdzs2181/f/downloads/Counting%20for%20Dollars%202020%20Brief%207A%20-%20Comprehensive%20Accounting.pdf

Reiter, J. P. (2019). Differential privacy and federal data releases. Annual Review of Statistics and Its Application, 6(1), 85–101. https://doi.org/10.1146/annurev-statistics-030718-105142

RStudio Team. (2017). RStudio: Integrated development for R (1.1.383) [Computer software]. RStudio. http://www.rstudio.com

Ruggles, S., Fitch, C., Magnuson, D., & Schroeder, J. (2019). Differential privacy and census data: Implications for social and economic research. AEA Papers and Proceedings, 109, 403–408. https://doi.org/10.1257/pandp.20191107

Ruggles, S., Flood, S., Goeken, R., Grover, J., Meyer, E., Pacas, J., & Sobek, M. (2018). IPUMS USA: Version 8.0 extract of 1940 Census for U.S. Census Bureau Disclosure Avoidance Research [dataset]. IPUMS. https://doi.org/10.18128/D010.V8.0.EXT1940USCB

Sexton, W. (2019, December 12). Day 2 follow-up. Workshop on 2020 Census Data Products: Data Needs and Privacy Considerations. National Academies. https://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_197520.pdf

Spicer, K. (2020). Statistical Disclosure Control (SDC) for 2021 UK Census (No. EAP125; Methodological Assurance Review). UK Statistics Authority. https://uksa.statisticsauthority.gov.uk/wp-content/uploads/2020/07/EAP125-Statistical-Disclosure-Control-SDC-for-2021-UK-Census.docx

StataCorp. (2015). Stata statistical software: Release 14. StataCorp LP.

StataCorp. (2019). Stata statistical software: Release 16. StataCorp LP.

2020 Census DAS Development Team. (2019). Disclosure Avoidance System for the 2020 Census, End-to-End release: Uscensusbureau/census2020-das-e2e [Python]. U.S. Census Bureau. https://github.com/uscensusbureau/census2020-das-e2e

U. S. Census Bureau. (2018). Statistical safeguards. https://www.census.gov/about/policies/privacy/statistical_safeguards.html

Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D. R., Steinke, T., & Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vanderbilt Journal of Entertainment & Technology Law, 21(1), 209–276. https://scholarship.law.vanderbilt.edu/jetlaw/vol21/iss1/4/

Appendix

Dissimilarity Indices

Table A1 summarizes differences in values of the dissimilarity index (D) by the global privacy-loss budget $\varepsilon$ , as well as by county population and region. As with Figures 4a, 4b, and 5, D is calculated between Whites and African Americans. D is quite sensitive to measurement error, and working down rows demonstrates that this relationship is not related monotonically with $\varepsilon$ . The problem is generally more severe in areas that are smaller or more rural, as evidenced by the differences by county population.

Table A1. Distribution of county-level dissimilarity indices between White and African American populations calculated from enumeration districts, for IPUMS full-count and differentially private (DP) data files.

	All Counties		Below Median Population		Above Median Population		Former Slave States
	M	SD	M	SD	M	SD	M	SD
IPUMS	0.5300	0.2989	0.5234	0.3106	0.5647	0.2710	0.4496	0.2070
DP: $\varepsilon = 0.25$	0.5116	0.3328	0.4905	0.3414	0.5304	0.3140	0.4721	0.2112
DP: $\varepsilon = 1.0$	0.5351	0.2999	0.5357	0.3130	0.5476	0.2758	0.4528	0.1952
DP: $\varepsilon = 8.0$	0.5523	0.2766	0.5623	0.2822	0.5690	0.2465	0.4570	0.1962

^Note.^{Means and standard deviations represent the simple average across counties (not weighted by population). Statistics for the last three rows are further based on averages across four different implementations or runs (with a given} $\varepsilon$ ^{) of the Disclosure Avoidance System (DAS) differential privacy algorithm. The first pair of columns includes all counties (for which}^D^{can be calculated), the second pair restricts to counties below median population (approximately 18,000), the third pair restricts to counties above median population, and the fourth pair restricts to counties in the former slaveholding states (AL, AR, DC, FL, GA, KY, MD, MO, MS, NC, SC, TN, TX, and VA).}

Multigroup Entropy

Multigroup entropy (H) is a measure of evenness that considers the simultaneous distribution of three or more mutually exclusive population subgroups among a set of smaller geographies nested within a larger geography (Iceland, 2004; Iceland et al., 2002). Again, the smaller geographies are enumeration districts, which nest within a county. We partition our data into four subgroups: White, African American, American Indian/Alaska Native, and Asian.19 H is the weighted average deviation of an enumeration district’s entropy from the county’s entropy.

Equation A1 provides the entropy index for a given geographic unit:

e_{i} = \ \sum_{j = 1}^{k}p_{\text{ij}}\ ln\left( \frac{1}{p_{\text{ij}}} \right)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (A1) \tag{3}

where $k$ is the number of population subgroups and $p_{\text{ij}}$ is the $j^{\text{th}}\$ subgroup’s proportion of unit $i$ ’s total population. We compute $e_{i}$ for every enumeration district within a county and for the county as a whole.

We then use equation A2 to compute the multigroup entropy, H, value for a county:

H = \ \sum_{i = 1}^{n}\left\lbrack \frac{t_{\text{i\ }}(e - e_{i})}{\text{et}} \right\rbrack\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (A2) \tag{4}

where $n$ is the number of enumeration districts, $t_{i}$ is the total population of enumeration district $i$ , $e$ is the entropy index for the county, $e_{i}$ is the entropy index for enumeration district $i$ , and $t$ is total population of the county.

Values of H may vary from 0 to 1. Counties with relatively even distributions of subgroups across enumeration districts will have lower values of H, indicating a lower level of segregation. Counties with uneven distributions of subgroups across enumeration districts (e.g., enumeration districts are composed entirely of persons from one population subgroup) will have higher values of H, indicating a higher level of segregation.

Multigroup entropy is often recommended as the most appropriate segregation index because of its decomposability and its sensitivity to the reconfiguration of population among geographic units (S. Reardon, 2017; S. F. Reardon & Firebaugh, 2002; S. F. Reardon & O’Sullivan, 2004). Like the index of dissimilarity, multigroup entropy is also sensitive to measurement error, albeit less so than D. If a single person of a particular subgroup is added to a district that previously included zero persons of that subgroup, D will change from 0 to nearly 1. H will also increase from 0, but the magnitude of the increase will be smaller than D’s increase. This difference is the result of H taking the size of the measurement error into account via the population-weighted sum (Equation A2).

Table A2 summarizes differences in values of the multigroup entropy index (H) by the global privacy-loss budget $\varepsilon$ , as well as by county population and region. Unlike some other measures, H can be calculated for all 3,108 counties, and the first column shows the number of such counties for which H computed from the differentially private data is less than H computed from the full-count IPUMS data. The right-most column shows the corresponding percentage of counties with H_{diff. private} < H_IPUMS. Lower values of H indicate less segregation; thus, counties for which H_{diff. private} < H_IPUMS indicate the DAS algorithm generates data showing less segregation than was observed in the full-count data.

Table A2. Difference between multigroup entropy, H, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget.

a. DAS
	Counties With H_{diff. private} < H_IPUMS	Percent of counties With H_{diff. private} < H_IPUMS
0.25	1,106	35.6
0.50	1,360	43.8
0.75	1,559	50.2
1.0	1,662	53.5
2.0	1,885	60.6
4.0	2,083	67.0
6.0	2,179	70.1
8.0	2,168	69.8

b. Geographic Levels
Nation	972	31.3
State	1,014	32.6
County	973	31.3
Enumeration district	1,977	63.6

c. Queries
Detailed	1,725	55.5
Voting age–Hispanic–Race	1,774	57.1
Household–Group quarters	833	26.8

^Note^{. Panel}^a^{uses differentially private data files released by the Census Bureau, and rows reflect the global privacy-loss budget (PLB). Panels}^b^and^c^{use differentially private data files created by IPUMS using the Disclosure Avoidance System (DAS) procedure, with global PLB of 1.0 and allocations as shown in Table 2.}^H^{can be computed for all 3,108 counties in the 1940 data; thus, the denominator for the percentage column is 3,108 for all rows.}

The results in Table A2 mirror those found in Table 6. For the Census Bureau’s differentially private data set (Table A2a), the percentage of counties with H_{diff. private} < H_IPUMS increases with $\varepsilon$ until plateauing for $\varepsilon$ between 4.0 and 8.0. For the differentially private data set where we allocate a large fraction of the privacy-loss budget (PLB) to one geographic level (Table A2b), the percentage of counties with H_{diff. private} < H_IPUMS is roughly one-third for the nation, state, and county allocations, but jumps to 64% for the enumeration district allocation. Finally, for the differentially private data set where we allocate a large fraction of the PLB to a single query (Table A2c), the percentage of counties with H_{diff. private} < H_IPUMS is between 55% and 57% for the detailed and voting age–Hispanic–race allocations, and drops to 27% for the household–group quarters allocation.

Figure A1 plots, for various values of $\varepsilon$ , H for the differentially private data set against H for the IPUMS full-count data. For lower values of $\varepsilon$ , we observe more counties to the left of the identity line, indicating a positive bias in the differentially private data. As $\varepsilon$ increases, we observe the point cloud shifting down and to the right, indicating a negative bias. Note that the bias is particularly noticeable for counties with a minority population share of less than 10%. Counties with a minority population share of at least 10% are more symmetrical around the identity line for all values of epsilon, suggesting little bias.

^{Figure A1}^.^{Multigroup entropy,}^H^{, for different global privacy-loss budgets.}^{Panel labels are the global privacy-loss budgets used to generate the differentially private data. The teal dots are counties with the minority group (e.g.}^,^{White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

Figures A2 and A3 repeat the exercise but use differentially private data sets that vary the allocations by geographic level (Figure A2) and query (Figure A3). The patterns in these figures mirror those for D (Figures 7 and 8). When we vary the PLB allocation by geographic level, we observe a positive bias in the differentially private data for the nation, state, and county allocations, and a negative bias for the enumeration district allocation. Varying the allocation by query, we observe a noticeable positive bias for the household–group quarters allocation.

^{Figure A2}^.^{Multigroup entropy,}^H^{, for varying fractional allocations to geographic levels.}^{Panel labels are the geographic levels receiving a 0.85 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

^{Figure A3}^.^{Multigroup entropy,}^H^{, for varying fractional allocations to queries.}^{Panel labels are the queries receiving a 0.90 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

Isolation

Isolation (B) is a measure of exposure between two or more mutually exclusive population groups. It essentially measures the possibility that a member of the minority group in a neighborhood will come into contact with another member of the minority group (Iceland et al., 2002). Unlike evenness measures, such as multigroup entropy and the index of dissimilarity, exposure measures explicitly account for the relative sizes of the minority and majority groups. Thus, minority groups comprising a small proportion of a county will have low isolation values no matter how evenly they are spread throughout the county (Massey & Denton, 1988).

As with D and H, the isolation index B is computed using a set of smaller geographies nested within a larger geography. We use enumeration districts and counties for the smaller and larger geographies, respectively. B also requires mutually exclusive population subgroups, and we partition our data into White and non-White categories.

Equation A3 provides the isolation, B, value for a given county:

B = \ \sum_{i = 1}^{n}{\left( \frac{\text{pop}_{\text{inw}}}{\text{pop}_{\text{nw}}} \right)\left( \frac{\text{pop}_{\text{inw}}}{\text{pop}_{i}} \right)}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (A3) \tag{5}

Where $n$ is the number of enumeration districts in the county, $\text{pop}_{\text{inw}}$ is the non-White population in enumeration district $i$ , $\text{pop}_{\text{nw}}$ is the non-White population of the county, and $\text{pop}_{i}$ is the total population in enumeration district $i$ . This formulation expresses isolation as the minority-weighted average of the minority population in a county (Iceland et al., 2002).

Values of B computed from two mutually exclusive population subgroups vary from 0 to 1. The interpretation of B mirrors those for D and H. Higher values of B indicate higher levels of segregation, and lower values of B indicate lower levels of segregation.

Like the index of dissimilarity, the isolation index, B, is sensitive to measurement error. In counties with zero non-Whites, measurement error that assigns a person of that subgroup to the county will upwardly bias B. In highly segregated counties, measurement error that assigns a non-White person to an enumeration district that has no non-White members can shift the index downward. The magnitude of the bias tends to be smaller than D because B accounts for the overall size of the non-White subgroup in the county.

Table A3 summarizes differences in the values of isolation (B) by the global privacy-loss budget $\varepsilon$ , as well as by county population and region. The left-most data column lists the number of counties with valid values of B for both the differentially private and IPUMS data sets.20 The middle column lists the number of such counties for which B computed from the differentially private data is less than B computed from the full-count IPUMS data. The right-most column shows the corresponding percentage of counties with B_{diff. private} < B_IPUMS. Lower values of B indicate less segregation; thus, when we observe counties with B_{diff. private} < B_IPUMS, the differential privacy algorithm generates data showing less segregation than was observed in the full-count data. However, note that minority groups comprising a small proportion of a county will have low absolute isolation values no matter how evenly they are spread throughout the county.

Table A3. Difference between the isolation, B, for Whites and non-Whites, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget allocation.

a. DAS
	Counties With Valid Values of B	Counties With B_{diff. private} < B_IPUMS	Percent of Counties With B_{diff. private} < B_IPUMS
0.25	2,742	1,049	38.3
0.50	2,777	1,293	46.6
0.75	2,799	1,422	50.8
1.0	2,827	1,530	54.1
2.0	2,853	1,720	60.3
4.0	2,857	1,834	64.2
6.0	2,853	1,908	66.9
8.0	2,862	1,853	64.7

b. Geographic Levels
Nation	2,720	966	35.5
State	2,683	923	34.4
County	2,856	1,020	35.7
Enumeration district	2,691	1,570	58.3

b. Queries
Detailed	2,840	1,609	56.7
Voting age–Hispanic–Race	2,823	1,604	56.8
Household–Group quarters	2,765	874	31.6

^Note^{. Panel}^a^{uses differentially private data files released by the Census Bureau, and rows reflect the global privacy-loss budget (PLB). Panels}^b^and^c^{use differentially private data files created by IPUMS using the Disclosure Avoidance System (DAS) procedure, with global PLB of 1.0 and allocations as shown in Table 2.}^B^{can be computed for only a subset of the 3,108 counties in the 1940 data; thus, the denominator for the percentage column is shown in the first data column.}

The results in Table A3 are similar to those found in Table 6 and Table A2. For the Census Bureau’s differentially private data set (Table A3a), the percentage of counties with B_{diff. private} < B_IPUMS increases as $\varepsilon$ increases until plateauing for $\varepsilon$ between 4.0 and 8.0. For the differentially private data set where we allocate a large fraction of the PLB to one geographic level (Table A3b), the percentage of counties with B_{diff. private} < B_IPUMS ranges from 34% to 35% for the nation, state, and county allocations, and jumps to 58% for the enumeration district allocation. Finally, for the differentially private data set where we allocate a large fraction of the PLB to a single query (Table A3c), the percentage of counties with B_{diff. private} < B_IPUMS is approximately 57% for both the detailed and voting age–Hispanic–race allocations, but drops to 32% for the household–group quarters allocation.

Figure A4 plots, for various values of $\varepsilon$ , B for the differentially private data set against B for the IPUMS full-count data. For lower values of $\varepsilon$ , we observe more variation around the identity line, especially for counties with a minority population share of less than 10%. For counties with a minority population share of at least 10%, we observe more variation for lower values of $\varepsilon$ , but the point cloud converges on the identity line by $\varepsilon$ of 2.0.

^{Figure A4}^.^Isolation,^B^{, for different global privacy-loss budgets.}^{Panel labels are the global privacy-loss budgets used to generate the differentially private data. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

Figures A5 and A6 repeat the exercise but use differentially private data sets that vary the allocations by geographic level (Figure A5) and query (Figure A6). When we vary the privacy-loss budget allocation by geographic level, we observe a positive bias for the nation, state, and county allocations. The spread around the identity line, though, is smaller for B than the other two indices. For the enumeration district allocation, we observe less variation around the identity line, except for low values of B. Varying the allocation by query, we observe a noticeable positive bias for the household–group quarters allocation.

^{Figure A5}^.^Isolation,^B^{, for varying fractional allocations to geographic levels.}^{Panel labels are the geographic levels receiving a 0.85 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

^{Figure A6}^.^Isolation,^B^{, for varying fractional allocations to queries.}^{Panel labels are the queries receiving a 0.90 allocation. The teal dots are counties with the minority group (e.g., White or non-White) comprising less than 10% of the county’s total population. The orange dots are counties with the minority group (e.g., White or non-White) comprising more than 10% of the county’s total population. DAS = Data produced by the Disclosure Avoidance System; IPUMS = Original 1940 full-count data.}

Appendix References

Iceland, J. (2004). The multigroup entropy index (also known as Theil’s H or the information theory index). U.S. Census Bureau. https://www2.census.gov/programs-surveys/demo/about/housing-patterns/multigroup_entropy.pdf

Massey, D. S., & Denton, N. A. (1988). The dimensions of residential segregation. Social Forces, 67(2), 281–315. https://doi.org/10.2307/2579183

Reardon, S. (2017). A conceptual framework for measuring segregation and its association with population outcomes. In J. M. Oakes & J. S. Kaufman (Eds.), Methods in social epidemiology (2nd ed., pp. 132–157). Jossey-Bass.

Reardon, S. F., & Firebaugh, G. (2002). Measures of multigroup segregation. Sociological Methodology, 32(1), 33–67. https://doi.org/10.1111/1467-9531.00110

Reardon, S. F., & O’Sullivan, D. (2004). Measures of spatial segregation. Sociological Methodology, 34(1), 121–162. https://doi.org/10.1111/j.0081-1750.2004.00150.x

©2022 Brian Asquith, Brad Hershbein, Tracy Kugler, Shane Reed, Steven Ruggles, Jonathan Schroeder, Steve Yesiltepe, and David Van Riper. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Assessing the Impact of Differential Privacy on Measures of Population and Racial Residential Segregation

Abstract

1. Introduction

2. What Is Differential Privacy?

3. Data

Table 1. Microdata variables and their categories used in the census DAS queries.

Table 2. Fractional allocation of the privacy-loss budget (PLB) for three differentially private data sets.

4. Methods

4.1. Population Differences

4.2. Segregation

5. Results

5.1. Accuracy in Population Counts for Counties Under Default PLB Allocations

Table 3. Distribution of county-level relative (log) population differences between IPUMS full-count and differentially private data files.

5.2. Accuracy in Population Counts for Geographies Under Varying PLB Allocations

Table 4. Mean absolute errors in total population counts between IPUMS full-count and differentially private data files, by geography and privacy-loss budget (PLB) allocation.

Table 5. Mean absolute errors in total population counts between IPUMS full-count and differentially private data files, by race, geography, and privacy-loss budget (PLB) allocation.

5.3. Assessing Measures of Segregation

Table 6. Difference between index of dissimilarity, D, between Whites and non-Whites, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget (PLB) allocation.

6. Discussion

7. Conclusion

Acknowledgments

Contributions

Disclosure Statement

References

Appendix

Dissimilarity Indices

Table A1. Distribution of county-level dissimilarity indices between White and African American populations calculated from enumeration districts, for IPUMS full-count and differentially private (DP) data files.

Multigroup Entropy

Table A2. Difference between multigroup entropy, H, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget.

Isolation

Table A3. Difference between the isolation, B, for Whites and non-Whites, derived from IPUMS full-count and differentially private data files, by geography and privacy-loss budget allocation.

Appendix References

Connections