Skip to main content
SearchLoginLogin or Signup

The Effect of Differentially Private Noise Injection on Sampling Efficiency and Funding Allocations: Evidence From the 1940 Census

Published onJun 24, 2022
The Effect of Differentially Private Noise Injection on Sampling Efficiency and Funding Allocations: Evidence From the 1940 Census
·

Abstract

The Census Bureau has recently released a plan to ensure that all statistics released as part of the 2020 Census are protected by a noise injection algorithm that satisfies the definition of differential privacy. Using 1940 Census data that has been treated with a noise injection process similar to that proposed for the 2020 Census, we explore the utility of the 1940 data by analyzing how the additional noise might affect standard uses of decennial census data. We consider three separate uses of decennial census data: oversampling populations in surveys, screening operations for surveys of rare populations, and allocating federal funds to specific areas. We find that for use cases that involve large populations, the effects of noise injection are relatively modest. Nonetheless, the noise injection can lead to sampling-frame coverage issues for surveys of rare populations and to substantial misallocations of funds to local areas.

Keywords: differential privacy, decennial census, oversampling, funding allocations, probability sample, hard-to-reach population


Media Summary

The Census Bureau recently announced a new method to protect statistics released as part of the 2020 Census. This methodology, based on the principle of ‘differential privacy,’ introduces randomness to reported statistics in order to prevent individual responses from being disclosed. Because this process could lead to less data usability, it is important to understand how the new treatment of the data will affect typical uses of census data. We explore the effects of differential privacy using 1940 Census data that has been treated with using a similar methodology to the one proposed for the 2020 Census. We consider three common uses of decennial census data: oversampling populations in surveys, screening operations for surveys of rare populations, and allocating federal funds to specific areas. We find that for uses involving large groups of individuals such as oversampling of African Americans in a nationwide survey, the effects of differential privacy are relatively modest. Nonetheless, when data uses rely on accurate statistics for small groups of individuals, the data become substantially less useful. In particular, we find that the use of differential privacy leads to coverage issues for surveys of rare populations and to substantial misallocations of funds to local areas.


1. Introductions

In response to concerns that computational advances and ever-growing amounts of publicly retrievable data allow outside actors to ‘reidentify’ respondents in standard statistical products, the U.S. Census Bureau has announced that they will use differentially private (DP) noise injection techniques to protect respondent confidentiality in the 2020 Census. These new procedures are designed to provide strong privacy guarantees, and represent a significant departure from traditional statistical disclosure limitation techniques employed by the Census Bureau (Garfinkel et al., 2018; McKenna, 2018).

Because these techniques represent such a large change for the dissemination of decennial census data, they have been met with significant concern. Given that these methods will most likely lead to a loss in data utility, this concern is unsurprising. Nonetheless, how sizable this loss of utility will be is specific to the data set and privacy protection procedure that is used. Therefore, it is important to understand the potential effects of this new procedure for the 2020 Census. Decennial census statistics get used in a multitude of ways, and understanding the impact of DP in a variety of use cases is essential for informing the implementation of DP techniques.

We consider how some common uses of 2020 Census data might be impacted by DP. We do so by comparing uses of 1940 Census data with and without DP (when we began our work, the 1940 Census offered the only data set available on both a raw unadjusted and DP basis). We include three separate use cases, each considered under values of the privacy protection parameter, ϵ, ranging from 0.25 to 8.0 (smaller values provide greater privacy protection but less utility). While our analyses focus on the negative effects of additional noise on data utility, we acknowledge that any additional noise also serves to provide additional privacy to respondents. Combining analyses of utility and privacy effects may help to inform an optimal choice of ϵ.

Our first two use cases come from using census data in survey sampling. Many important government and private sector surveys, including the General Social Survey, Current Population Survey, and National Health Interview Survey, use census data for sampling and weighting (Parsons et al., 2014; Smith et al., 2019; Wolter et al., 2015). First, we consider a hypothetical survey that aims to oversample African Americans. We find that even with strong privacy protection (low values of ϵ), the oversampling process is only modestly less efficient than it would be if it were applied to raw unadjusted census data, and as ϵ increases the results are virtually identical with those arising from use of the unaltered census data. Second, we consider a hypothetical survey in which a brief screening interview is conducted to generate a sample survey of the adult American Indian and Alaskan Native (AIAN) population. We show that using noise-injected data creates inefficiencies in the survey design, with the degree of inefficiency positively associated with the level of privacy protection. Sponsors of surveys would have to pay for the inefficiencies either through increased survey budgets or through reduced sample size and, in turn, decreased precision. In addition to design inefficiency, we find some enumeration districts with nonzero AIAN population according to raw unadjusted census data have zero AIAN population according to DP protected census data. Consequently, given DP, whole segments of the AIAN population may not be subjected to sampling in the survey, which could inject a serious bias into the survey statistics.

Our third use case examines implications of DP for a hypothetical governmental program that distributes funds to geographic areas on the basis of the number of children in the area. Comparing the results of funding allocations corresponding to, respectively, DP protected and raw census data reveals that DP noise injection misallocates funds when ϵ is low and the misallocation decreases as ϵ increases.

There are a few caveats to note about our results. The 1940 Census data contain a relatively limited set of variables, which allowed us to analyze only a few important use cases. Also, the values of the privacy protection parameter, ϵ, applied to the 1940 Census are unlikely to reflect the same amount of noise as will be applied to the 2020 Census, and the direction of this effect is ambiguous. On one hand, improvements and refinements made to the Census Bureau’s DP algorithm will imply that the same ϵ will lead to less noise in 2020. On the other hand, if more statistics are released in 2020 than in the current 1940 data, the same global privacy budget, ϵ, will lead to more noise being added to the data in 2020 when compared to 1940. In addition, the geographies in 1940 do not reflect modern geographic boundaries. While enumeration districts are roughly similar in size to a modern census block group, they are drawn with different boundaries and do not have the same size distribution (we discuss these issues in more detail in Section 2.3). Finally, we note that our analysis only covers the effects of DP on data utility. The introduction of additional noise into the census data decreases data utility (as shown here), but comes with the benefit of providing stronger privacy protections to respondents. The charge of any data provider is to balance the important competing objectives of utility and privacy (Leclerc, 2019).

The rest of the article is structured as follows. Section 2 provides background for our analysis, Section 3 discusses the data sources used in the article, and Sections 4–6 present our methodology and results for each of our three use cases. Section 7 concludes and discusses avenues for future research.

2. Background

2.1. Differential Privacy

As mentioned previously, increasing amounts of publicly available data and growing computational power have made it easier for attackers to make improper use of statistical data, such as identifying respondent information from aggregate anonymized statistics. Given that statistical agencies are legally obligated to ensure that data are used for statistical purposes only, this ‘reidentification’ of respondents is a serious and growing concern. In order to combat these challenges, DP adds a calibrated amount of noise to the data to make it less likely that respondent information can be accurately identified.

As described in Dwork and Roth (2014), the basic intuition behind ε-DP is that an attacker’s knowledge about the characteristics of an individual respondent should not improve markedly based on whether or not the individual’s information is included in the DP analysis. More formally, the output of the DP mechanism is almost equally likely to be generated from a neighboring data set as the data set used to create the statistic. To define this formally, consider a mechanism that takes as an input a data set X and produces an estimate θ. θ that represents a random variable conditioned on the values of the data. Examples could include a table package, microdata set, or in the current context, the full set of tables released as part of the decennial census. The mechanism is ε-DP if for all events in θ and data sets y where x and y differ by one record:

P(θX=x)eϵP(θX=y)P(θ│X=x)≤e^ϵ P(θ|X=y)

The value of ϵ controls the amount of noise added, and is set by the data provider to obtain a balance between privacy and data utility (Abowd & Shmutte, 2019). Depending on the value of ϵ, the same DP algorithm can produce results that are either almost entirely noise or essentially identical to those that would be obtained from the original data.

Note also that this definition cannot be satisfied unless the function θ(.) induces randomness to the distribution. The definition we present here assumes the raw unadjusted census data are fixed, and the only randomness introduced is by the privacy protection mechanism. This is not necessary for all definitions of DP, however. See Vadhan (2017) or Kifer and Machanavahhjala (2014) for further discussion and examples.

This definition of privacy provides strong privacy guarantees without making assumptions about the knowledge of potential data attackers. In addition, once protected with a DP method, data can be postprocessed in any manner and retain the same level of privacy protection. This allows the suite of methods satisfying this definition to provide flexible solutions for ensuring that data sets do not leak information about respondents. Therefore, these methods have a number of potential applications in federal statistical agencies, as discussed in Reiter (2019) and boyd (2019).

Nonetheless, many observers have raised concerns that this definition may lead to excessive amounts of noise being added to the census data. For example, Ruggles et al. (2019) suggest that the definition of DP is too large a departure from traditional disclosure avoidance methods and is not worth the cost of loss of utility in census data products. In addition, the privacy measure, ϵ, may or may not correspond to actual reidentification risk (McClure & Reiter, 2012). Finally, interpretation of privacy protection may be difficult for relatively large values of ϵ. For example, if ϵ = 0.10, this intuitively represents a roughly 10% increase in the probability of a bad event happening if a research subject chooses to share their data (Wood et al., 2018). However, if ϵ = 4, this relative increase would be e^4, or a 50-fold increase in the probability of a bad event. If ϵ = 8, the same increase is e^8≈3,000.

2.2. Differentially Private Noise Injection in the 2020 Census

A detailed summary of the 2020 Census DP algorithm similar to the one analyzed in this article is available in Abowd et al. (2019), Leclerc (2019), and Ashmead et al. (2019), but we summarize some important details here. Note that details of the algorithm have continued to evolve since the demonstration data used in this article were produced, and production DP census data could differ from the DP demonstration data used here.

At a high level, the algorithm works by dividing the data up into a number of bins and constructing counts of individuals in each of these bins (e.g., the count of the population in a given census block*age group*race*ethnicity*gender combination). After this, a random draw from a Laplace distribution is added to each cell of data and statistics of interest are computed off of this noise-injected data. This additional noise introduces uncertainty into the end statistics that makes it harder for potential attackers to reconstruct confidential data.

This implementation of DP makes use of the matrix mechanism described in Li et al. (2015), but includes features that are unique to the decennial census context. First, the implementation respects certain ‘invariants,’ which are a group of statistics that are released without any noise injection. Current plans are for state-level population counts as well as housing unit totals and group quarters totals by group quarters type to be invariant. However, raw population counts at low levels of geography will not necessarily reflect true population totals.

In addition, this operationalization of DP is constructed with the Census Bureau’s traditional geographic hierarchy in mind. In other words, noise is added at different geographic levels so that statistics for more aggregated geographic areas contain less noise than statistics from less aggregated geographic areas. This will be important for the results shown following: any use of decennial census data for larger geographies will be much less affected by DP than uses that rely on precise information for smaller geographic levels.

The noise injection procedure also applies a number of ‘postprocessing’ steps to ensure that the DP data satisfy standard hierarchical relationships in decennial census data. These include constraints on the data set so that there can be no negative population counts for any geographies or subgroups, as well as ensuring that the data are constructed so that the sum of smaller cells will reproduce exactly the counts in larger cells.

Finally, the amount of noise to be added in this procedure is governed by a ‘global ϵ.’ This global ϵ is then allocated among tables and geographies, with each individual table and geographic level receiving a share of this total budget. How this global ϵ is split is a potentially important policy decision for the Census Bureau.

2.3. Differentially Private Noise Injection in the 1940 Census Test Data

To enable testing of the effects of DP on utility, the Census Bureau released a DP version of the 1940 Census. The noise-injected files were created using essentially the same procedure outlined in Section 2.2, except for a couple of important changes that influence the conclusions of the analysis. First, the Census Bureau only released three tables of data. In practice, this means that the data is constructed by forming cross-tabulations of group quarters, voting age, race, and ethnicity and then adding noise to these cross-tabulations. While the 1940 DP procedures take similar postprocessing steps to the 2020 DP algorithm in order to account for factors such as invariants and structural zeroes, there are nonetheless a limited number of variables available in these noise-injected files relative to what will be included in the full production of the 2020 Decennial Census.

Second, because census geographic definitions have evolved over time, the geographies of the 1940 Census data are different from current census geographies. While DP algorithms on modern decennial census data will inject noise along the traditional ‘spine’ of census geographies (block, block group, tract, tract group [added to facilitate the 2020 algorithm], county, state, nation), the 1940 Census data only contain enumeration district, county, state, and nation. While the median enumeration district is similar in size to a modern block group (between 600 and 3,000 population), they are not necessarily comparable geographic units. We do our best throughout the analysis to explore the sensitivity of our results to geographies of varying sizes, but future research with more modern data should better address this issue.

Finally, in addition to a relatively limited set of variables and a different geographic hierarchy, it is quite likely that the values of ϵ for 1940 Census data will not lead to the same amount of noise in the data as they will in the 2020 Census. As discussed previously, any changes to the underlying algorithm or to the amount of statistics produced will impact the amount of noise that is effectively added to specific statistics to guarantee the same level of privacy.

3. Data Description and Descriptive Comparison

 We use two separate data sources for our analysis. First, we consider the 1940 Census full count file made available by the Minnesota Population Center (Ruggles et al., 2018). This file contains all the variables present in the 1940 Census for the entire U.S. population, but we restrict our attention to a set of variables that align with those that are present in the noise-injected data:

  • Geography (state, county, enumeration district)

  • Group quarters status

  • Race (White, Black, AIAN, Chinese, Japanese, other Asian/Pacific Islander)

  • Hispanic ethnicity

  • Indicator variable for whether the individual is 18 years or older (i.e., an indicator of voting age)

Our noise-injected files were made available by the Census Bureau using the 1940 Census noise injection algorithm described here and available at https://www2.census.gov/census_1940/. These files are full synthetic population files that are constructed to generate the tables noted in Section 2.3. To permit a range of tests, these files were produced for eight different values of ϵ: 0.25, 0.50, 0.75, 1, 2, 4, 6, and 8. In addition, because the noise injection process includes a random component, the Census Bureau released four separate runs of the file for each value of ϵ. For our use-cases, we present analyses across each of these runs in order to illuminate the variability that can be expected in census tabulations as a result of random draws within the same noise injection process.

Prior to diving into specific use cases, we first provide a descriptive comparison of the noise-injected and original data. As a first pass, we construct cells of race*voting age counts for every enumeration district. To simplify matters, we collapse the racial/ethnic categories into five smaller categories: White non-Hispanic, Black non-Hispanic, Asian non-Hispanic, AIAN non-Hispanic, and Hispanic. This results in population counts for roughly 2.7 million cells of data. In order to provide a sense of the magnitude of noise that is injected into the data with DP, Figure 1 shows, for any non-empty cell in the true data, the percent difference between the DP data and the true data for each of these race*voting age*enumeration district cells across cell size and the value of ϵ. The first row shows the distribution for ϵ = 0.25. For cells of less than 100 or 100–999 individuals, there are quite thick tails in the distribution and many counts are off by more than 35–50%. Indeed, even for large cells of greater than 1,000 individuals, a sizeable number of cell counts are off by at least 10%. As ϵ increases, the distribution tightens around zero. For ϵ = 1, the distribution for larger cells of at least 1,000 individuals is quite compact. For ϵ = 8, the distribution is relatively tight around zero for all values, especially as the true cell count exceeds 100 individuals.

Figure 1. Distribution of percent difference in population counts for race*age*enumeration district cells, run 1. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. The unit of observation is a race*age*enumeration district cell of data. Plots show histograms of the percentage difference (DP CountTrue CountTrue Count\frac{DP\ Count-True\ Count} {True\ Count}) in counts for cells by size of the underlying count and values of ϵ. Note that scales of the y-axes differ between graphs in order to show the shape of the distributions.

While Figure 1 focuses only on cells that are non-empty in the true data, Figure 2 considers the related question of how many cells contain 0 counts in the true data but have nonzero counts in the noise-injected data. Panel A shows the proportion of empty cells in the true data that have nonzero counts in the noise-injected data by ϵ. Interestingly, this number increases as ϵ increases, but nonetheless stays relatively low at less than 5% for all ϵ. Panel B shows that the average noise-injected population count for these nonzero count cells is small. Therefore, for most of these cells for which there are zero population counts in the true data, the population counts in the noise-injected data are either zero or very small across all values of ϵ.

Figure 2. Prevalence and size of cells with nonzero population in noise-injected data and zero population in true data. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. The unit of observation is a race*age*enumeration district cell of data. Panel A shows the fraction of cells that are empty in the true data and have nonzero counts in the DP data. Panel B shows the average count in the DP data for these cells.

To provide an overview of how these cell-level population count differences translate to differences at the enumeration district level, Table 1 provides a set of correlation coefficients between total nongroup quarters population counts in the noise-injected files and the unaltered data. Each row presents a set of correlations for a given ϵ and a given run of the noise injection algorithm. The results are very similar to those presented in Figure 1. Overall, the correlation between enumeration district population counts in the DP and the true data is relatively high—at least 0.98 for all ϵ. Nonetheless, when examining enumeration districts with fewer than 100 people, the noise from DP almost entirely obscures the association between the population sizes; the correlation coefficients are below 0.1 for all ϵ.

Table 1. Correlation coefficients between total household population counts in true and noise-injected files: 1940 Census, enumeration district level.

ϵϵ

Run

All

< 100 True Population

100-999 True Population

At least 1,000 True Population

0.25

1

0.980

0.042

0.886

0.981

2

0.980

0.043

0.885

0.981

3

0.980

0.043

0.885

0.981

4

0.980

0.043

0.884

0.981

0.50

1

0.985

0.053

0.918

0.988

2

0.985

0.051

0.918

0.988

3

0.985

0.053

0.918

0.988

4

0.985

0.051

0.917

0.988

0.75

1

0.987

0.058

0.925

0.990

2

0.987

0.058

0.925

0.990

3

0.987

0.058

0.924

0.990

4

0.987

0.059

0.925

0.990

1

1

0.987

0.062

0.927

0.990

2

0.987

0.062

0.927

0.990

3

0.987

0.061

0.927

0.990

4

0.987

0.063

0.927

0.990

2

1

0.987

0.068

0.930

0.991

2

0.987

0.068

0.930

0.991

3

0.987

0.068

0.930

0.991

4

0.987

0.069

0.930

0.991

4

1

0.987

0.071

0.930

0.991

2

0.987

0.071

0.930

0.991

3

0.987

0.071

0.930

0.991

4

0.987

0.071

0.930

0.991

6

1

0.987

0.072

0.930

0.991

2

0.987

0.072

0.930

0.991

3

0.987

0.072

0.930

0.991

4

0.987

0.072

0.930

0.991

8

1

0.987

0.072

0.930

0.991

2

0.987

0.072

0.930

0.991

3

0.987

0.072

0.930

0.991

4

0.987

0.072

0.930

0.991

N Enumeration Districts

 

134,017

9,182

69,814

55,021

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. All statistics are for the nongroup quarters population.

While raw population counts are important statistics, decennial census data are also used extensively for tracking how communities have grown over time. To assess how DP might affect growth rates, Table 2 presents distributional statistics of differences in the growth rate in county population between 1930 and 1940. These are constructed using the true county population in 1930 and comparing it to either the true population in 1940 or the noise-injected population count in 1940. The results show that growth rates are particularly sensitive to the choice of ϵ and that DP noise injection can create extreme differences between true and noise-injected values. For example, for ϵ = 0.25, 5% of counties experience growth rates that are almost 50 percentage points less than their true growth rate. Again, as ϵ approaches 8, these differences become relatively minimal. Nonetheless, since a given global ϵ is likely to produce more noise in the 2020 Census than in the 1940 Census, these results still raise concerns about the effects of DP noise on the accuracy of county growth statistics.

Table 2. Distribution of differences in population change due to differential privacy, 1930–1940: county level.

 

 

Percentage Change Distribution

ϵϵ

N

5th Percentile

25th Percentile

Median

75th Percentile

95th Percentile

0.25

3096

-47.55%

-5.85%

-0.38%

5.48%

40.59%

0.50

3096

-21.80%

-2.71%

-0.01%

3.51%

25.61%

0.75

3096

-16.19%

-1.97%

-0.09%

2.04%

16.54%

1

3096

-12.08%

-1.52%

-0.02%

1.67%

13.04%

2

3096

- 6.16%

-0.72%

0.00%

0.99%

7.45%

4

3096

- 3.39%

-0.42%

0.00%

0.50%

3.94%

6

3096

-2.61%

-0.30%

0.00%

0.33%

2.37%

8

3096

-1.60%

-0.21%

0.00%

0.26%

2.14%

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. The unit of observation is the county. Cells show the distribution of difference in percentage population growth from 1930 to 1940 between DP and true data (DP change–true change). Percentiles represent the percentiles of the distribution of percentage change. Population growth is calculated using true county population in 1930 Census and either DP or true population from 1940 Census data.

To determine how differences in descriptive statistics might affect other real-world uses of decennial census data, we now run through three separate uses cases and analyze how DP data would affect the results. For each use case, we discuss the methodology used, the results, and any specific caveats to keep in mind when attempting to extrapolate the results of this analysis to the 2020 Census setting.

4. Oversampling of African Americans

To produce accurate statistics for both the total population and subgroups of the population, surveys are often designed to oversample the subgroup(s) of analytical interest (Barron et al., 2015; Biemer & Lyberg, 2003; Cochran, 1977; Kalton, 2009; Kalton & Anderson, 1986). Given that oversampling relies on targeting these populations using decennial census data, DP noise injection could potentially make the process of oversampling less efficient. Specifically, the more noise that is added to the data, the less efficient any oversampling strategy will be. In order to estimate how DP will affect this process, we introduce a potential oversampling procedure and then show how the results of this procedure will differ when using noise-injected data. While additional methodologies such as moving to a panel survey, sampling frame enhancements, and expanding use of administrative records may help to ameliorate some of the negative effects shown here, the exact methodological details of these methods are unclear and they would come with their own start-up costs. Therefore, we consider them outside the scope of the current analysis.

4.1. Setup

 We consider the specific example of creating an oversample of African Americans in an existing sample survey. First, we construct a sampling frame of specific geographies from across the country to mimic a standard survey frame. In particular, we select 150 geographic areas of larger than 500,000 individuals to comprise our primary sampling units. From each geographic area, we draw four enumeration districts for rural areas and eight enumeration districts for urban areas. This yields a survey frame that includes 800 enumeration districts in total. This process creates geographic clustering in the sample, which helps to limit the costs of in-person interviewing. Note that while not shown here, results using the entire nation as a sampling frame are qualitatively similar.

Enumeration districts are classified into two domains based on whether the concentration of African Americans is above a threshold. We set this threshold at 30%, but our results are qualitatively similar to modest changes in this value. Denote these domains as H and L, where the total number of individuals in H and L are denoted by NHN_H and NLN_L, respectively. Furthermore, let the total number of African American individuals in each domain be denoted as DLD_L and DHD_H, respectively.

There are two stages to sampling. In the first stage, enumeration districts are selected with probability proportional to the size of the district population. Enumeration districts in H are oversampled, where the oversampling is governed by a parameter b such that the probability that any single district is selected is b times greater for enumeration districts in H than for districts in L. In the second stage of sampling, a fixed number of individuals are randomly selected from each of the enumeration districts selected in the first stage. For concreteness, suppose we design a sample of 1,000 individuals. The fractions of individuals selected for the survey are denoted as follows:

fL=1000NL+NHb,for individuals in districts in LfH=1000bNL+NHbfor individuals in districts in H,f_L= \frac{1000}{N_L+N_H b}, \text {for individuals in districts in} \ L \\ f_H=\frac{1000b}{N_L+N_H b} \text {for individuals in districts in} \ H,

where b is a parameter controlling the degree of oversampling. As b→∞, only individuals in the high-density domain are sampled. In the primary results below, we use b = 2, but results are qualitatively similar for other common values of b.

The expected number of African Americans in the resulting 1,000-person sample will be fLDL+fHDH.f_L D_L+f_H D_H. Because noise injection will be applied to total population counts as well as to the population counts of specific subgroups, both the sampling fractions and the density of African Americans will be altered. If the noise is relatively small, the fraction of African Americans in the survey will be close to what would be obtained if the sampling of enumerations districts was with probability proportional to the true data. As the amount of noise grows larger, the fraction of African Americans in the selected sample will converge toward what would be obtained from a stratified random sample with proportional allocation to the high- and low-density strata.

4.2. Results

From our hypothetical sampling frame, we select a sample from the noise-injected data using the methodology described. We then analyze how many individuals in our sample would be African American using the true data, and how many households would need to be sampled in order to ensure that 1,000 African Americans are in the sample. Throughout, we assume that 100% of individuals contacted would complete the survey. Provided that response rates are constant across groups, survey nonresponse should affect sampling using both DP and true data equally.

Table 3a provides statistics on the sample representation using this process. In the first row, we see that with the unaltered data, 20.5% of the sample will be African American. With relatively small ϵ, this number drops slightly to 20.1% for run 1. This is a small change though, and as ϵ increases, the percentage of African Americans in the sample based on DP data appears very similar to what would be obtained from a survey using the unaltered data. Note also that the variation among runs is relatively minor. Even for the most extreme difference when ϵ = 0.25, there is only a 0.26 percentage-point difference in the composition of the survey sample.

Table 3a. Fraction of sample that is African American when oversampling African Americans in a sample survey.

Run

1

2

3

4

True Data

20.53%

Noise-Injected Data

ϵ = 0.25

20.12%

20.14%

20.38%

20.14%

ϵ = 0.50

20.31%

20.41%

20.32%

20.37%

ϵ = 0.75

20.46%

20.43%

20.33%

20.42%

ϵ = 1

20.42%

20.50%

20.41%

20.44%

ϵ = 2

20.50%

20.49%

20.47%

20.48%

ϵ = 4

20.51%

20.49%

20.49%

20.47%

ϵ = 6

20.50%

20.51%

20.50%

20.50%

ϵ = 8

20.50%

20.50%

20.50%

20.50%

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the fraction of the sample that will be African American. Sample is drawn by sampling enumeration districts with probability proportional to size, and sampling enumeration districts that are at least 30% African American at twice the rate. Of enumeration districts in the sample, 147 of 800 are in the high sampling rate stratum. Response rates across groups assumed to be constant at 100%.

As another way of presenting this information, Table 3b shows the number of completed surveys that would be needed in order to obtain 1,000 African American respondents. This shows the same pattern: there are very small differences for low ϵ, but as ϵ increases the results using the DP data look very similar to those using the true unaltered data. While not shown here, results using an oversampling parameter anywhere in the range of 5–10 are qualitatively similar. Therefore, for a relatively large subgroup such as African Americans, the projected decrease in sampling efficiency of oversampling is likely to be small.

Table 3b. Expected number of completed surveys to achieve 1,000 African American respondents when oversampling African Americans in a sample survey.

Run

1

2

3

4

True Data

4870

Noise-Injected Data

ϵ = 0.25

4969

4966

4907

4965

ϵ = 0.50

4924

4900

4922

4910

ϵ = 0.75

4886

4895

4919

4898

ϵ = 1

4897

4878

4901

4894

ϵ = 2

4879

4882

4884

4883

ϵ = 4

4876

4880

4879

4885

ϵ = 6

4879

4877

4878

4878

ϵ = 8

4878

4878

4878

4877

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the fraction of the sample that will be African American. Sample is drawn by sampling enumeration districts with probability proportional to size, and sampling enumeration districts that are at least 30% African American at twice the rate. Of enumeration districts in the sample, 147 of 800 are in the high sampling rate stratum. Response rates across groups assumed to be constant at 100%.

5. Conducting a Survey of the Adult AIAN Population

If a subgroup is rare enough, oversampling the group as part of a general population survey may be cost prohibitive. A more cost-effective alternative may be to conduct a survey that screens individuals in order to create a sample of only individuals from the rare subgroup. To make this screening operation as efficient as possible, a sample might be designed with a heavy oversample of areas with high concentrations of the rare subgroup, and, as with the oversampling case shown, DP noise injection would lead to a loss of efficiency in this process. Following, we discuss the mechanisms of how this process works and then provide evidence on the potential effects of DP noise injection in the case of surveying the adult AIAN population.

5.1. Setup

Consider the population of adult AIAN individuals. Similar to our first use case, denote the entire adult population size as Na,N^a, which is divided into two sampling strata of sizes NLaN_L^a and NHaN_H^a, such that NLa+NHa=NaN_L^a+N_H^a=N^a. Stratum L is the low-density stratum consisting of tracts in which less than 30% of the individuals are adult and identify as AIAN, and stratum H consists of tracts with at least a 30% adult AIAN population. An equal number of screener interviews are then conducted for all sampled enumeration districts regardless of strata.

Let fHa=nHa/NHaf_H^a=n_H^a/N_H^a be the sampling fraction for adults in stratum H, and letfLa=fHa/baf_L^a=f_H^a/b_a be the sampling fraction in stratum L, where bab^a controls the degree to which the sample is tilted toward the high-density stratum. As bab^aincreases, the sample becomes more heavily represented by individuals in the high-density stratum. This leads to increased efficiency for the screening operation (i.e., reduced cost per completed interview of AIAN adults, but increased sampling variance for estimators of survey parameters of interest. Given proper weighting of the survey data, no bias is introduced by the oversampling procedure. While some of this lack of representativeness could be corrected for using survey weights, this will only correct for observed characteristics that are used to form the weights. Any differences between the strata due to unobserved characteristics not captured by survey weights may still lead to decreases in representativeness.

If we wish to obtain a sample of 1,000 AIAN adults, then we will need to draw a sample of the following size:

1000NLa+NHabadLaNLa+dHaNHaba1000 \frac{N_L^a+N_H^a b^a}{d_L^a N_L^a+d_H^a N_H^a b^a}

where dLd_L is the fraction of adults in the low-density stratum who identify as AIAN and dHd_H is the fraction in the high-density stratum who are AIAN. If, for example, all individuals in the survey frame are AIAN adults then dLa=dHa=1d_L^a=d_H^a=1 and 1,000 screeners would be needed to obtain the 1,000 individual sample. As the density of the target subgroup decreases, more screeners will be needed to achieve a given sample size. For the purposes of this exercise, we assume that any area with a known 0% density of AIAN adult individuals is excluded from consideration. These areas could be easily added to the sampling frame for the purposes of this exercise, but for the examples shown here this results in a prohibitive increase in costs due to lower sampling efficiency. Therefore, we do not consider this further.

With DP data, both the strata definitions and the density of AIAN adults in each strata will be altered from their true values. Therefore, the noise injected into the data by DP could lead to decreases in efficiency. In addition, the results will show a loss in coverage for certain segments of the population if the noise induced by DP leads enumeration districts that contain AIAN adults to appear as having no AIAN adult population.

Note also that in the results, the expected number of completes needed to obtain a sample of a given size will not be comparable to those for the African American oversampling result in Section 4. This is due to the fact that the concentration of these populations differs as well as the fact that we use an oversampling parameter (b) of 2 in Section 4 and an oversampling parameter of 5 here in Section 5.

5.2. Results

Using all enumeration districts in the United States in 1940 as the sampling frame, we calculate the resulting fraction of the sample that will be adult AIAN individuals and use this to estimate how many screeners will be needed to achieve a sample of 1,000 adult AIAN individuals. As before, we assume that there is perfect response given that any survey nonresponse will affect the operations using DP data or true data equally.

Table 4 presents the results of this exercise. Each row cell contains the number of screeners needed in expectation to achieve a sample of 1,000 adult AIAN individuals for a specific value of ϵ and run of the DP noise algorithm. Given how rare this population is, it would take significant effort to achieve this sample—even with the true data, it would require 2,691 screeners to achieve this sample. The results using the DP data show modest increases in the number of screeners for low values of ϵ. For example, when ϵ = 0.25, 3,172–3,297 screeners are needed to obtain the 1,000-individual sample. As ϵ increases, the results are almost identical to those using the true data. Across all values of ϵ, the results are relatively stable across all runs of the DP algorithm.

Table 4. Number of screeners required to achieve a 1,000-person adult American Indian and Alaskan Native (AIAN) sample.

Run

1

2

3

4

True Data

2691

Noise-Injected Data

ϵ = 0.25

3172

3178

3297

3180

ϵ = 0.50

2896

2914

2908

2911

ϵ = 0.75

2817

2812

2826

2830

ϵ = 1

2788

2786

2785

2792

ϵ = 2

2745

2741

2742

2743

ϵ = 4

2725

2718

2722

2722

ϵ = 6

2719

2717

2716

2716

ϵ = 8

2714

2712

2712

2714

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate.

To translate these results to costs, we need to make an assumption about the cost of a screener interview relative to a full survey interview. Given that relative costs could vary from survey to survey, we analyze a variety of cost ratios. We simplify slightly by abstracting from the fixed costs of running a survey operation and assuming that all costs are either due to screener interviews or survey interviews. Incorporating fixed costs would lead to smaller impacts of DP noise on survey costs.

Table 5 shows the relative increase in costs based on assumptions that the full survey interview is the same cost, twice as expensive, 5 times as expensive, or 10 times as expensive as a screener interview. Clearly, as the cost of a full interview relative to a screener interview increases, the effect of DP noise injection on costs decreases. For ϵ = 0.25, the increase in costs depends strongly on the assumptions made about interview cost structures. With relatively more expensive screener interviews, survey costs increase by more than 10%. As the screeners become relatively less expensive, the increase in costs is more modest. For values of ϵ above 1, the increase in costs is modest regardless of assumption about cost structure. In theory, these results could differ by both the degree of oversampling as well as the decision of where to split the stratum. While not reported here, our results are robust to changes in both the oversampling parameter as well as the choice of where to divide the enumeration districts into high- and low-density strata.

Table 5. Increase in survey cost due to differentially private (DP) when surveying the adult American Indian and Alaskan Native (AIAN) population.

Cost of Interview Relative to Screener

1x (same cost)

2x

5x

10x

ϵ = 0.25

13.02%

10.25%

6.25%

3.79%

ϵ = 0.50

5.56%

4.37%

2.67%

1.62%

ϵ = 0.75

3.42%

2.69%

1.64%

0.99%

ϵ = 1

2.63%

2.07%

1.26%

0.77%

ϵ = 2

1.45%

1.14%

0.70%

0.42%

ϵ = 4

0.92%

0.72%

0.44%

0.27%

ϵ = 6

0.75%

0.59%

0.36%

0.22%

ϵ = 8

0.62%

0.49%

0.30%

0.18%

Note. From 1940 Census demonstration DP data and IPUMS-USA 1940 full count data. Cells contain the percent increase in cost to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate. Columns differentiate the cost of a full interview relative to a theoretically shorter screener interview.

Another way to visualize this phenomenon is to examine the relationship between true and DP percent AIAN across enumeration districts. Figure 3 presents scatterplots of the relationship between true percent AIAN and the percent AIAN in the noise-injected data, with each dot weighted by the size of the enumeration district. We can see that with large ϵ, the dots become clustered around the 45-degree line, representing much less noise being added to the data. However, for small values of ϵ there is significant dispersion around the 45-degree line and many enumeration districts experience large percentage changes in their AIAN population due to the addition of noise.

Figure 3. Differences in percent American Indian and Alaskan Native (AIAN) by epsilon and strata. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. All figures are from run 1 of the differential privacy algorithm. The unit of observation is the enumeration district.

In order to assess whether these results are driven by enumeration districts with very rare populations, Tables 6a and 6b present the same results as Table 5 but restricted either to districts of at least 100 or 1,000 total population. Specifically, for both the true and noise-injected data, we consider sampling only enumeration districts of a given minimum size. While the results in Table 6a for enumeration districts of at least 100 population are very similar to those in Table 5, the results in Table 6b show that when restricting only to enumeration districts of 1,000 population, the screening operation is equally efficient for both true and noise-injected data. Therefore, if only 1,000-person geographic areas were available for sampling in this data, the noise injection would have little influence. This may be important in many modern settings, where slightly larger geographic units such as census tracts are often used for sampling purposes. However, if survey designers need to access data for smaller geographic areas, then there likely would be a loss in efficiency.

Table 6a. Number of screeners to obtain 1,000-person adult American Indian and Alaskan Native (AIAN) sample, restricted to enumeration districts of a given size, enumeration districts of at least 100 population.

Run

1

2

3

4

True Data

2712

Noise-Injected Data

ϵ = 0.25

3065

3094

3238

3107

ϵ = 0.50

2847

2846

2849

2835

ϵ = 0.75

2782

2768

2783

2797

ϵ = 1

2762

2766

2756

2765

ϵ = 2

2739

2738

2736

2738

ϵ = 4

2733

2727

2730

2733

ϵ = 6

2733

2728

2731

2728

ϵ = 8

2729

2727

2727

2730

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate. Both results for true data as well as DP data are restricted to enumeration districts of at least 100 population, which represents 93.1% of all enumeration districts.

Table 6b. Number of screeners to obtain 1,000-person adult American Indian and Alaskan Native (AIAN) sample, restricted to enumeration districts of a given size, enumeration districts of at least 1,000 population.

Run

1

2

3

4

True Data

3011

Noise-Injected Data

ϵ = 0.25

3016

3085

3160

3116

ϵ = 0.50

3027

3021

2980

3065

ϵ = 0.75

3007

3008

3035

3003

ϵ = 1

2960

3014

3030

3015

ϵ = 2

3041

3016

2997

3023

ϵ = 4

3014

3020

3006

3018

ϵ = 6

3024

3011

3027

3019

ϵ = 8

3040

3020

3025

3023

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate.

All in all, the results of DP noise injection on the efficiency of screeners are relatively modest. While the projected cost increase for ϵ = 0.25 is large, the noise from DP does not have a large effect on screening costs for values of ϵ above 1. We now show that this is a product of a lack of coverage for the sampling frame for small values of ϵ. In particular, areas with adult AIAN individuals are not being represented in the survey sampling frame because the noise from DP makes the area appear to have 0 adult AIAN population. To assess this lack of coverage, Table 7 presents results on the fraction of the adult AIAN population that is covered by the sampling frame for different values of epsilon and runs of the DP algorithm. With small values of ϵ, there is significant undercoverage of the survey frame and roughly 30% of adult AIAN individuals are not represented. Thus, there is potential for important bias in estimators of population parameters of interest. Given appropriate control totals, calibration estimators may be effective in limiting the bias due to undercoverage.

Table 7. Coverage of American Indian and Alaskan Native (AIAN) population using noise-injected data.

Run

1

2

3

4

True Data

100%

Noise-Injected Data

ϵ = 0.25

70.64%

70.78%

68.85%

70.52%

ϵ = 0.50

81.92%

81.54%

81.71%

81.50%

ϵ = 0.75

86.57%

86.67%

86.48%

86.63%

ϵ = 1

88.90%

88.96%

88.45%

88.60%

ϵ = 2

93.11%

92.91%

92.88%

93.03%

ϵ = 4

95.83%

95.80%

95.76%

95.84%

ϵ = 6

97.08%

97.05%

96.78%

97.07%

ϵ = 8

97.72%

97.71%

97.80%

97.64%

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Figures in cells refer to the percent of the adult AIAN population in the true data that resides in an enumeration district with 0 AIAN adult individuals in the noise-injected data. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates.

This undercoverage becomes smaller as ϵ increases, but even modest values such as ϵ = 1 have over 10% undercoverage. Nonetheless, when ϵ = 8, the undercoverage is only a couple of percentage points. Given that the sampling process is based on averages across areas, it is likely that the undercoverage issues we are seeing are largely due to the noise injection per se, and not additional pieces of the DP algorithm such as postprocessing or invariants. While the inclusion of these additional features is likely to have some effect on the types of areas with more accurate data, they are likely a second-order issue in the context of the tract-level data being used to drive sampling decisions.

Note that in theory another solution to undercoverage would be to add a separate stratum to the sampling design for enumeration districts with zero reported adult AIAN population. This method would potentially eliminate bias yet result in an increase, possibly a very large increase, in the cost of the survey, which may be practically infeasible. Given fixed cost, the method would increase the variance and diminish the precision of survey estimators. Still another solution may involve a dual-frame sampling design in which the second sampling frame is a list rich in AIAN adults obtained through administrative sources (records).

6. Allocating Funding on the Basis of Children Under the Age of 18

The federal government distributed about $721 billion (about 16% of its budget) to states and localities in fiscal year 2019, providing about one-quarter of these governments’ total revenues. About 61% of those funds were dedicated to health care, 16% to income security programs, and 9% each to transportation and education, training, employment, and social services (Tax Policy Center, 2020). The distribution was based, in part, on decennial census population counts and related data obtained from the American Community Survey and other Census Bureau programs. Further, the distribution relied on data collected in various large government surveys, whose sampling designs relied on measures of size derived from the decennial census. Given the amounts of money involved, the importance of accurate census data is undeniable. Thus, it is important to reach an understanding of how DP noise injection might distort such fund allocations in future years.

Toward this end, we consider a hypothetical program that allocates funding to areas on the basis of the number of children under 18 years residing in the area (excluding group quarters). We take a full budget of $5 billion to be allocated across areas in proportion to the number of children in an area, which corresponds roughly to $125 a child. We then calculate the allocated per-child funding in both the DP and true data. We do this separately for both enumeration districts and counties. This analysis relates to the prior literature on the effects of error in population estimates on funding allocations, but is more specific in that it considers the effects of DP and uses 1940 data as a test case (National Research Council, 2003; Spencer, 1980; Spencer et al., 2017; Zaslavsky & Schirm, 2002).

Table 8a shows the distribution, across counties, of misallocated dollars per child. When restricting our attention to counties, the misallocation is relatively modest. Even for ϵ = 0.25, the 10th and 90th percentiles are roughly $3 per child in either direction. While this is modest as a percentage of total funding, it may still be large enough to cause concerns for districts that depend on the funds.

Table 8a. Per-child misallocation of funding in hypothetical funding formula due to noise injection, county level.

 

Percentile Misallocation

 

10

25

50

75

90

Standard Deviation

ϵ = 0.25

-3.29

-1.76

-0.56

0.85

3.09

19.47

ϵ = 0.50

-2.19

-1.35

-0.61

0.32

1.70

12.45

ϵ = 0.75

-1.91

-1.24

-0.69

0.08

1.26

3.73

ϵ = 1

-1.71

-1.19

-0.71

-0.03

1.08

7.29

ϵ = 2

-1.44

-1.13

-0.79

-0.22

0.78

13.87

ϵ = 4

-1.09

-0.82

-0.33

0.61

3.81

ϵ = 6

-1.29

-1.10

-0.85

-0.36

0.58

11.65

ϵ = 8

-1.29

-1.10

-0.86

-0.38

0.62

2.27

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data), where percentile misallocation refers to the percentile of the empirical distribution of per child misallocation. Overall per child funding ≈$125.

Table 8b. Per-child misallocation of funding in hypothetical funding formula due to noise injection, enumeration district level.

 

Percentile Misallocation

 

10

25

50

75

90

Standard Deviation

ϵ = 0.25

-57.19

-21.15

-1.91

17.91

55.83

577.48

ϵ = 0.50

-28.97

-10.91

-1.02

9.74

30.79

484.22

ϵ = 0.75

-19.76

-7.53

-0.77

6.68

21.30

432.59

ϵ = 1

-15.12

-5.90

-0.73

5.16

16.95

417.06

ϵ = 2

-8.18

-3.43

-0.69

2.67

9.16

408.98

ϵ = 4

-4.71

-2.23

-0.73

1.32

5.31

411.04

ϵ = 6

-3.52

-1.84

-0.78

0.83

3.95

414.30

ϵ = 8

-3.50

-1.64

-0.82

0.58

3.88

412.44

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data), where percentile misallocation refers to the percentile of the empirical distribution of per child misallocation. Overall per child funding ≈$125.

Table 8b presents the results for enumeration districts, which shows that the misallocation could be quite extreme. In fact, for ϵ = 0.25, the 10th and 90th percentiles of per-child misallocation are over $50 in absolute value, which is almost half of the original allocation. As another perspective, Figure 4 presents the full distribution of differences in allocation for various levels of ϵ. For low values of ϵ, there is a large mass at -$125, indicating enumeration districts that theoretically would lose all of their funding due to DP noise. As ϵ increases, the distribution tightens around 0 and the misallocation becomes less pronounced. Nonetheless, for many values of ϵ, the noise induced by DP leads to large misallocations of funds. Note that there is also a slight shift in the distribution of funds, as the median funding difference is negative. This result is an artifact of the postprocessing, which restricts population counts to be greater than zero while also restricting state-level population to be invariant. While these effects are noteworthy, on the whole the primary patterns in the results are driven by the noise injection procedure itself and not postprocessing.

Figure 4. Differences in funding allocation for hypothetical formula. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. The unit of observation is the enumeration district. Note that scales of the y-axes differ between graphs in order to show the shape of the distributions.

Given that these results could be driven by enumeration districts with very rare populations, Table 9a and 9b provides the same results as Table 8b but restricted to enumeration districts of at least 100 or 1,000 individuals. While low values of ϵ still lead to pronounced misallocations, the results show that there is less misallocation when restricting to these larger enumeration districts. This highlights the fact that the DP procedure produces larger amounts of error for smaller areas precisely because it is these smaller areas that pose more privacy risk.

Table 9a. Per-child misallocation of funding in hypothetical funding formula due to noise injection, excluding small enumeration districts, enumeration district of at least 100 population.

 

Percentile Misallocation

 

10

25

50

75

90

Standard Deviation

ϵ = 0.25

-48.33

-19.51

-1.83

16.58

47.58

182.82

ϵ = 0.50

-24.76

-10.23

-1.01

8.90

25.65

134.59

ϵ = 0.75

-16.89

-7.07

-0.76

6.11

17.71

101.71

ϵ = 1

-13.15

-5.58

-0.75

4.63

13.86

100.14

ϵ = 2

-7.20

-3.29

-0.70

2.38

7.48

86.89

ϵ = 4

-4.22

-2.17

-0.73

1.14

4.28

83.89

ϵ = 6

-3.21

-1.81

-0.77

0.71

3.18

83.18

ϵ = 8

-2.72

-1.62

-0.81

0.48

2.63

83.27

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data). Overall per child funding ≈$125.

Table 9b. Per-child misallocation of funding in hypothetical funding formula due to noise injection, excluding small enumeration districts, enumeration district of at least 1,000 population.

 

Percentile Misallocation

 

10

25

50

75

90

Standard Deviation

ϵ = 0.25

-25.61

-12.47

-1.65

9.67

24.51

84.50

ϵ = 0.50

-13.55

-6.75

-1.01

5.14

13.04

63.19

ϵ = 0.75

-9.43

-4.75

-0.76

3.55

9.22

50.99

ϵ = 1

-7.41

-3.84

-0.77

2.65

7.07

61.86

ϵ = 2

-4.23

-2.38

-0.69

1.36

4.07

31.97

ϵ = 4

-2.71

-1.69

-0.69

0.64

2.54

15.25

ϵ = 6

-2.20

-1.51

-0.72

0.41

2.14

11.04

ϵ = 8

-1.93

-1.29

-0.75

0.29

1.93

11.21

Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data). Overall per child funding ≈$125.

7. Conclusion

While DP noise injection algorithms provide strong guarantees for protecting respondent privacy, they potentially diminish the utility of census data that have been collected at the cost of roughly $10 billion. We have analyzed the lost utility in terms of three separate use cases: 1) building an oversample of African Americans in a general population survey, 2) designing a survey with a screening instrument to interview only AIAN adults, and 3) allocating funding to local areas on the basis of number of children in an area. Because the effect of DP noise on data utility can vary widely depending on the particular use case and DP noise algorithm, it is important to analyze the effects of noise injection on data utility across a variety of scenarios.

Our results show that the effect of noise varies widely by use case and value of the privacy protection parameter, ϵ. When looking at statistics for relatively large groups (such as the use case of oversampling African Americans), the effect of DP noise injection is relatively modest. Nonetheless, uses of decennial census data that depend on accuracy for rare demographic groups or small geographic areas can be significantly affected by the additional noise. In particular, we find that unless ϵ is large, for a survey of the adult AIAN population DP noise injection can lead to both a large loss of survey coverage as well as moderate cost increases to achieve a constant sample size. Moreover, our results indicate that small values of ϵ can also lead to substantial misallocations of funding for small geographic areas. As either ϵ or population size increases, these misallocations due to DP noise become more modest.

It is also worth noting that for the 2020 Census, the privacy budget, ϵ, will be spread across more tables of data and more geographies than was done for the 1940 data. Therefore, the values of ϵ considered in our analysis of 1940 data are unlikely to translate to similar losses of data utility in the 2020 Census. Whether this results in more or less data utility for a given level of privacy is not yet known. On one hand, because the privacy budget must be spread across many more tables in the 2020 Census, the amount of noise added to the data for a given ϵ may be much larger in 2020. In addition, the 1940 Census data we analyze here contains a limited set of variables and much different geographic definitions than will be released in 2020. On the other hand, future refinements in the algorithm could lead to greater efficiencies. Because ϵ is a measure of privacy and not of data utility, there may be additional methodological refinements that could significantly improve utility for data users. For example, additional analyses in the December 2019 Committee on National Statistics Workshop on 2020 Data Products have shown that more recent vintages of DP demonstration data lead to very few effects on survey efficiency, but can lead to decreases in coverage for small populations (Brummet, 2019). Nonetheless, as DP methodologies continue to evolve, it will be important to conduct additional analyses of the kind described here, both to measure the impacts of DP on data privacy and utility and to illuminate the appropriate balance between these competing ends.


Disclosure Statement

We are grateful for support from the Alfred P. Sloan Foundation (award G-2019-12589).


References

Abowd, J., Ashmead, R., Garfinkel, S., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, A., & Sexton, W. (2019). Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0945_Consistency_for_Large_Scale_Differentially_Private_Histograms.pdf

Abowd, J. M., & Schmutt, I. A. (2019). An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review, 109(1), 171–202. https://doi.org/10.1257/aer.20170627

Ashmead, R., Kifer, D., Leclerc, P., Machanavajjhala, A., & Sexton, W. (2019). Effective privacy after adjusting for invariants with applications to the 2020 Census. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0941_Effective_Privacy_after_Adjusting_for_Constraints__With_applications_to_the_2020_Census.pdf

Barron, M., Davern, M., Montgomery, R., Tao, X., Wolter, K. M., Zeng, W., Dorell, C., & Black, C. (2015). Using auxiliary sample frame information for optimum sampling of rare populations. Journal of Official Statistics, 31(4), 545–557. https://doi.org/10.1515/jos-2015-0034

Biemer, P. P., & Lyberg, L. E. (2003). Introduction to survey quality (Vol. 335). John Wiley & Sons. https://doi.org/10.1002/0471458740

boyd, d. (2019). Differential privacy in the 2020 Census and the implications for available data products. arXiv. https://doi.org/10.48550/arXiv.1907.03639

Brummet, Q. (2019). The effect of differential privacy on survey operations. Paper presented at National Academies of Sciences, Engineering and Medicine, Committee on National Statistics Workshop on 2020 Census Data Products, Washington, DC, December 2019. https://www.nationalacademies.org/event/12-11-2019/workshop-on-2020-census-data-products-data-needs-and-privacy-considerations

Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley & Sons. https://www.wiley.com/en-us/Sampling+Techniques%2C+3rd+Edition-p-9780471162407

Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/​0400000042

Garfinkel, S. L., Abowd, J. M., & Powazek, S. (2018). Issues encountered deploying differential privacy. In Proceedings of the 2018 Workshop on Privacy in the Electronic Society (pp. 133–137). ACM. https://doi.org/10.1145/3267323.3268949

Kalton, G., & Anderson, D. W. (1986). Sampling rare populations. Journal of the Royal Statistical Society: Series A, 149(1), 65–82. https://doi.org/10.2307/2981886

Kalton, G. (2009). Methods for oversampling rare subpopulations in social surveys. Survey Methodology, 35(2), 125–141. https://doi.org/10.2307/2981886

Kifer, D., & Machanavajjhala, A. (2014). Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 39(1), Article 3. https://doi.org/10.1145/2514689

Leclerc, P. (2019). Guide to the Census 2018 end-to-end test disclosure avoidance algorithm and implementation. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0938_2018_E2E_Test_Algorithm_Description.pdf

Li, C., Miklau, G., Hay, M., McGregor, A., & Rastogi, V. (2015). Optimizing linear counting queries under differential privacy. The VLDB Journal, 24(6), 757–781. https://doi.org/10.1145/1807085.1807104

McClure, D., & Reiter, J. (2012). Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data. Transactions on Data Privacy, 5(3), 535–552. https://dl.acm.org/doi/10.5555/2423656.2423658

McKenna, L. (2018). Disclosure avoidance techniques used for the 1970 through 2010 Decennial Censuses of Population and Housing. Center for Economic Studies Working Paper 18–47, U.S. Census Bureau. https://www.census.gov/content/dam/Census/library/working-papers/2018/adrm/Disclosure%20Avoidance%20Techniques%20for%20the%201970-2010%20Censuses.pdf

National Research Council. (2003). Statistical issues in allocating funds by formula. National Academies Press. https://doi.org/10.17226/10580

Parsons V. L., Moriarity, C., Jonas, K., Moore, T. F., Davis, K. E., & Tompkins, L. (2014). Design and estimation for the National Health Interview Survey, 2006–2015. Vital and Health Statistics Series, 2(165), 1–53. https://pubmed.ncbi.nlm.nih.gov/24775908/

Reiter, J. P. (2019). Differential privacy and federal data releases. Annual Review of Statistics and Its Application, 6(85), 85–101. https://doi.org/10.1146/annurev-statistics-030718-105142

Ruggles, S., Flood, S., Goeken, R. Grover, J., Meyer, E., Pacas, J., & Sobek, M. (2018). IPUMS USA: Version 8.0 Extract of 1940 Census for U.S. Census Bureau Disclosure Avoidance Research [data set]. IPUMS. https://doi.org/10.18128/D010.V8.0.EXT1940USCB

Ruggles, S., Fitch, C., Magnuson, D., & Schroeder, J. (2019). Differential privacy and census data: Implications for social and economic research. AEA Papers and Proceedings, 109, 403–408. https://doi.org/10.1257/pandp.20191107

Smith, T. W., Davern, M., Freese, J., & Morgan, S. L. (2019). General Social Surveys, 1972–2018: Cumulative codebook, Appendix A. NORC. National Data Program for the Social Sciences Series, no. 25. https://gss.norc.org/documents/codebook/gss_codebook.pdf

Spencer, B. (1980). Benefit-cost analysis of data used to allocate funds. Springer. https://doi.org/10.1007/978-1-4612-6099-8

Spencer, B. D., May, J., Kenyon S., & Seeskin, Z. (2017). Cost-benefit analysis for a Quinquennial Census: The 2016 Population Census of South Africa. Journal of Official Statistics, 33(1), 1–26. https://doi.org/10.1515/jos-2017-0013

Tax Policy Center. (2020). Tax Policy Center briefing book: The state of state (and local) policy. https://www.taxpolicycenter.org/briefing-book/what-types-federal-grants-are-made-state-and-local-governments-and-how-do-they-work#:~:text=The%20federal%20government%20distributed%20about,of%20these%20governments'%20total%20revenues

Vadhan, S. (2017). The complexity of differential privacy. In Y. Lindell (Ed.), Tutorials on the foundations of cryptography (pp. 347–450). Springer. https://doi.org/10.1007/978-3-319-57048-8_7

Wolter, K. M., Polivka, A. E., & Lubich, A. (2015). Evolution of the Current Population Survey. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, & J. L. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online. John Wiley & Sons. https://doi.org/10.1002/9781118445112.stat00063.pub2

Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D. R., Steinke, T., & Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vanderbilt Journal of Entertainment & Technology Law, 21(17), 209–276. http://doi.org/10.2139/ssrn.3338027

Zaslavsky, A., & Shirm, A. (2002). Interactions between survey estimates and federal funding formulas. Journal of Official Statistics, 18, 371–391.


©2022 Quentin Brummet, Edward Mulrow, and Kirk Wolter. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Connections
A Commentary on this Pub
Comments
0
comment

No comments here

Why not start the discussion?