Description
Reflections on Brummet et al. and Asquith et al.
The Census Bureau has recently released a plan to ensure that all statistics released as part of the 2020 Census are protected by a noise injection algorithm that satisfies the definition of differential privacy. Using 1940 Census data that has been treated with a noise injection process similar to that proposed for the 2020 Census, we explore the utility of the 1940 data by analyzing how the additional noise might affect standard uses of decennial census data. We consider three separate uses of decennial census data: oversampling populations in surveys, screening operations for surveys of rare populations, and allocating federal funds to specific areas. We find that for use cases that involve large populations, the effects of noise injection are relatively modest. Nonetheless, the noise injection can lead to sampling-frame coverage issues for surveys of rare populations and to substantial misallocations of funds to local areas.
Keywords: differential privacy, decennial census, oversampling, funding allocations, probability sample, hard-to-reach population
In response to concerns that computational advances and ever-growing amounts of publicly retrievable data allow outside actors to ‘reidentify’ respondents in standard statistical products, the U.S. Census Bureau has announced that they will use differentially private (DP) noise injection techniques to protect respondent confidentiality in the 2020 Census. These new procedures are designed to provide strong privacy guarantees, and represent a significant departure from traditional statistical disclosure limitation techniques employed by the Census Bureau (Garfinkel et al., 2018; McKenna, 2018).
Because these techniques represent such a large change for the dissemination of decennial census data, they have been met with significant concern. Given that these methods will most likely lead to a loss in data utility, this concern is unsurprising. Nonetheless, how sizable this loss of utility will be is specific to the data set and privacy protection procedure that is used. Therefore, it is important to understand the potential effects of this new procedure for the 2020 Census. Decennial census statistics get used in a multitude of ways, and understanding the impact of DP in a variety of use cases is essential for informing the implementation of DP techniques.
We consider how some common uses of 2020 Census data might be impacted by DP. We do so by comparing uses of 1940 Census data with and without DP (when we began our work, the 1940 Census offered the only data set available on both a raw unadjusted and DP basis). We include three separate use cases, each considered under values of the privacy protection parameter, ϵ, ranging from 0.25 to 8.0 (smaller values provide greater privacy protection but less utility). While our analyses focus on the negative effects of additional noise on data utility, we acknowledge that any additional noise also serves to provide additional privacy to respondents. Combining analyses of utility and privacy effects may help to inform an optimal choice of ϵ.
Our first two use cases come from using census data in survey sampling. Many important government and private sector surveys, including the General Social Survey, Current Population Survey, and National Health Interview Survey, use census data for sampling and weighting (Parsons et al., 2014; Smith et al., 2019; Wolter et al., 2015). First, we consider a hypothetical survey that aims to oversample African Americans. We find that even with strong privacy protection (low values of ϵ), the oversampling process is only modestly less efficient than it would be if it were applied to raw unadjusted census data, and as ϵ increases the results are virtually identical with those arising from use of the unaltered census data. Second, we consider a hypothetical survey in which a brief screening interview is conducted to generate a sample survey of the adult American Indian and Alaskan Native (AIAN) population. We show that using noise-injected data creates inefficiencies in the survey design, with the degree of inefficiency positively associated with the level of privacy protection. Sponsors of surveys would have to pay for the inefficiencies either through increased survey budgets or through reduced sample size and, in turn, decreased precision. In addition to design inefficiency, we find some enumeration districts with nonzero AIAN population according to raw unadjusted census data have zero AIAN population according to DP protected census data. Consequently, given DP, whole segments of the AIAN population may not be subjected to sampling in the survey, which could inject a serious bias into the survey statistics.
Our third use case examines implications of DP for a hypothetical governmental program that distributes funds to geographic areas on the basis of the number of children in the area. Comparing the results of funding allocations corresponding to, respectively, DP protected and raw census data reveals that DP noise injection misallocates funds when ϵ is low and the misallocation decreases as ϵ increases.
There are a few caveats to note about our results. The 1940 Census data contain a relatively limited set of variables, which allowed us to analyze only a few important use cases. Also, the values of the privacy protection parameter, ϵ, applied to the 1940 Census are unlikely to reflect the same amount of noise as will be applied to the 2020 Census, and the direction of this effect is ambiguous. On one hand, improvements and refinements made to the Census Bureau’s DP algorithm will imply that the same ϵ will lead to less noise in 2020. On the other hand, if more statistics are released in 2020 than in the current 1940 data, the same global privacy budget, ϵ, will lead to more noise being added to the data in 2020 when compared to 1940. In addition, the geographies in 1940 do not reflect modern geographic boundaries. While enumeration districts are roughly similar in size to a modern census block group, they are drawn with different boundaries and do not have the same size distribution (we discuss these issues in more detail in Section 2.3). Finally, we note that our analysis only covers the effects of DP on data utility. The introduction of additional noise into the census data decreases data utility (as shown here), but comes with the benefit of providing stronger privacy protections to respondents. The charge of any data provider is to balance the important competing objectives of utility and privacy (Leclerc, 2019).
The rest of the article is structured as follows. Section 2 provides background for our analysis, Section 3 discusses the data sources used in the article, and Sections 4–6 present our methodology and results for each of our three use cases. Section 7 concludes and discusses avenues for future research.
As mentioned previously, increasing amounts of publicly available data and growing computational power have made it easier for attackers to make improper use of statistical data, such as identifying respondent information from aggregate anonymized statistics. Given that statistical agencies are legally obligated to ensure that data are used for statistical purposes only, this ‘reidentification’ of respondents is a serious and growing concern. In order to combat these challenges, DP adds a calibrated amount of noise to the data to make it less likely that respondent information can be accurately identified.
As described in Dwork and Roth (2014), the basic intuition behind ϵ-DP is that an attacker’s knowledge about the characteristics of an individual respondent should not improve markedly based on whether or not the individual’s information is included in the DP analysis. More formally, the output of the DP mechanism is almost equally likely to be generated from a neighboring data set as the data set used to create the statistic. To define this formally, consider a mechanism that takes as an input a data set X and produces an estimate θ that represents a random variable conditioned on the values of the data. Examples could include a table package, microdata set, or in the current context, the full set of tables released as part of the decennial census. The mechanism is ϵ-DP if for all events in θ and data sets y where x and y differ by one record:
The value of ϵ controls the amount of noise added, and is set by the data provider to obtain a balance between privacy and data utility (Abowd & Shmutte, 2019). Depending on the value of ϵ, the same DP algorithm can produce results that are either almost entirely noise or essentially identical to those that would be obtained from the original data.
Note also that this definition cannot be satisfied unless the function θ(.) induces randomness to the distribution. The definition we present here assumes the raw unadjusted census data are fixed, and the only randomness introduced is by the privacy protection mechanism. This is not necessary for all definitions of DP, however. See Vadhan (2017) or Kifer and Machanavahhjala (2014) for further discussion and examples.
This definition of privacy provides strong privacy guarantees without making assumptions about the knowledge of potential data attackers. In addition, once protected with a DP method, data can be postprocessed in any manner and retain the same level of privacy protection. This allows the suite of methods satisfying this definition to provide flexible solutions for ensuring that data sets do not leak information about respondents. Therefore, these methods have a number of potential applications in federal statistical agencies, as discussed in Reiter (2019) and boyd (2019).
Nonetheless, many observers have raised concerns that this definition may lead to excessive amounts of noise being added to the census data. For example, Ruggles et al. (2019) suggest that the definition of DP is too large a departure from traditional disclosure avoidance methods and is not worth the cost of loss of utility in census data products. In addition, the privacy measure, ϵ, may or may not correspond to actual reidentification risk (McClure & Reiter, 2012). Finally, interpretation of privacy protection may be difficult for relatively large values of ϵ. For example, if ϵ = 0.10, this intuitively represents a roughly 10% increase in the probability of a bad event happening if a research subject chooses to share their data (Wood et al., 2018). However, if ϵ = 4, this relative increase would be e^4, or a 50-fold increase in the probability of a bad event. If ϵ = 8, the same increase is e^8≈3,000.
A detailed summary of the 2020 Census DP algorithm similar to the one analyzed in this article is available in Abowd et al. (2019), Leclerc (2019), and Ashmead et al. (2019), but we summarize some important details here. Note that details of the algorithm have continued to evolve since the demonstration data used in this article were produced, and production DP census data could differ from the DP demonstration data used here.
At a high level, the algorithm works by dividing the data up into a number of bins and constructing counts of individuals in each of these bins (e.g., the count of the population in a given census block*age group*race*ethnicity*gender combination). After this, a random draw from a Laplace distribution is added to each cell of data and statistics of interest are computed off of this noise-injected data. This additional noise introduces uncertainty into the end statistics that makes it harder for potential attackers to reconstruct confidential data.
This implementation of DP makes use of the matrix mechanism described in Li et al. (2015), but includes features that are unique to the decennial census context. First, the implementation respects certain ‘invariants,’ which are a group of statistics that are released without any noise injection. Current plans are for state-level population counts as well as housing unit totals and group quarters totals by group quarters type to be invariant. However, raw population counts at low levels of geography will not necessarily reflect true population totals.
In addition, this operationalization of DP is constructed with the Census Bureau’s traditional geographic hierarchy in mind. In other words, noise is added at different geographic levels so that statistics for more aggregated geographic areas contain less noise than statistics from less aggregated geographic areas. This will be important for the results shown following: any use of decennial census data for larger geographies will be much less affected by DP than uses that rely on precise information for smaller geographic levels.
The noise injection procedure also applies a number of ‘postprocessing’ steps to ensure that the DP data satisfy standard hierarchical relationships in decennial census data. These include constraints on the data set so that there can be no negative population counts for any geographies or subgroups, as well as ensuring that the data are constructed so that the sum of smaller cells will reproduce exactly the counts in larger cells.
Finally, the amount of noise to be added in this procedure is governed by a ‘global ϵ.’ This global ϵ is then allocated among tables and geographies, with each individual table and geographic level receiving a share of this total budget. How this global ϵ is split is a potentially important policy decision for the Census Bureau.
To enable testing of the effects of DP on utility, the Census Bureau released a DP version of the 1940 Census. The noise-injected files were created using essentially the same procedure outlined in Section 2.2, except for a couple of important changes that influence the conclusions of the analysis. First, the Census Bureau only released three tables of data. In practice, this means that the data is constructed by forming cross-tabulations of group quarters, voting age, race, and ethnicity and then adding noise to these cross-tabulations. While the 1940 DP procedures take similar postprocessing steps to the 2020 DP algorithm in order to account for factors such as invariants and structural zeroes, there are nonetheless a limited number of variables available in these noise-injected files relative to what will be included in the full production of the 2020 Decennial Census.
Second, because census geographic definitions have evolved over time, the geographies of the 1940 Census data are different from current census geographies. While DP algorithms on modern decennial census data will inject noise along the traditional ‘spine’ of census geographies (block, block group, tract, tract group [added to facilitate the 2020 algorithm], county, state, nation), the 1940 Census data only contain enumeration district, county, state, and nation. While the median enumeration district is similar in size to a modern block group (typically between 600 and 3,000 population), they are not necessarily comparable geographic units. We do our best throughout the analysis to explore the sensitivity of our results to geographies of varying sizes, but future research with more modern data should better address this issue.
Finally, in addition to a relatively limited set of variables and a different geographic hierarchy, it is quite likely that the values of ϵ for 1940 Census data will not lead to the same amount of noise in the data as they will in the 2020 Census. As discussed previously, any changes to the underlying algorithm or to the amount of statistics produced will impact the amount of noise that is effectively added to specific statistics to guarantee the same level of privacy.
We use two separate data sources for our analysis. First, we consider the 1940 Census full count file made available by the Minnesota Population Center (Ruggles et al., 2018). This file contains all the variables present in the 1940 Census for the entire U.S. population, but we restrict our attention to a set of variables that align with those that are present in the noise-injected data:
Geography (state, county, enumeration district)
Group quarters status
Race (White, Black, AIAN, Chinese, Japanese, other Asian/Pacific Islander)
Hispanic ethnicity
Indicator variable for whether the individual is 18 years or older (i.e., an indicator of voting age)
Our noise-injected files were made available by the Census Bureau using the 1940 Census noise injection algorithm described here and are available at https://www2.census.gov/census_1940/. These files are full synthetic population files that are constructed to generate the tables noted in Section 2.3. To permit a range of tests, these files were produced for eight different values of ϵ: 0.25, 0.50, 0.75, 1, 2, 4, 6, and 8. In addition, because the noise injection process includes a random component, the Census Bureau released four separate runs of the file for each value of ϵ. For our use-cases, we present analyses across each of these runs in order to illuminate the variability that can be expected in census tabulations as a result of random draws within the same noise injection process.
Prior to diving into specific use cases, we first provide a descriptive comparison of the noise-injected and original data. As a first pass, we construct cells of race*voting age counts for every enumeration district. To simplify matters, we collapse the racial/ethnic categories into five smaller categories: White non-Hispanic, Black non-Hispanic, Asian non-Hispanic, AIAN non-Hispanic, and Hispanic. This results in population counts for roughly 2.7 million cells of data. In order to provide a sense of the magnitude of noise that is injected into the data with DP, Figure 1 shows, for any non-empty cell in the true data, the percent difference between the DP data and the true data for each of these race*voting age*enumeration district cells across cell size and the value of ϵ. The first row shows the distribution for ϵ = 0.25. For cells of less than 100 or 100–999 individuals, there are quite thick tails in the distribution and many counts are off by more than 35–50%. Indeed, even for large cells of greater than 1,000 individuals, a sizeable number of cell counts are off by at least 10%. As ϵ increases, the distribution tightens around zero. For ϵ = 1, the distribution for larger cells of at least 1,000 individuals is quite compact. For ϵ = 8, the distribution is relatively tight around zero for all values, especially as the true cell count exceeds 100 individuals.
While Figure 1 focuses only on cells that are non-empty in the true data, Figure 2 considers the related question of how many cells contain 0 counts in the true data but have nonzero counts in the noise-injected data. Panel A shows the proportion of empty cells in the true data that have nonzero counts in the noise-injected data by ϵ. Interestingly, this number increases as ϵ increases, but nonetheless stays relatively low at less than 5% for all ϵ. Panel B shows that the average noise-injected population count for these nonzero count cells is small. Therefore, for most of these cells for which there are zero population counts in the true data, the population counts in the noise-injected data are either zero or very small across all values of ϵ.
To provide an overview of how these cell-level population count differences translate to differences at the enumeration district level, Table 1 provides a set of correlation coefficients between total nongroup quarters population counts in the noise-injected files and the unaltered data. Each row presents a set of correlations for a given ϵ and a given run of the noise injection algorithm. The results are very similar to those presented in Figure 1. Overall, the correlation between enumeration district population counts in the DP and the true data is relatively high—at least 0.98 for all ϵ. Nonetheless, when examining enumeration districts with fewer than 100 people, the noise from DP almost entirely obscures the association between the population sizes; the correlation coefficients are below 0.1 for all ϵ.
Run | All | < 100 True Population | 100-999 True Population | At least 1,000 True Population | |
0.25 | 1 | 0.980 | 0.042 | 0.886 | 0.981 |
2 | 0.980 | 0.043 | 0.885 | 0.981 | |
3 | 0.980 | 0.043 | 0.885 | 0.981 | |
4 | 0.980 | 0.043 | 0.884 | 0.981 | |
0.50 | 1 | 0.985 | 0.053 | 0.918 | 0.988 |
2 | 0.985 | 0.051 | 0.918 | 0.988 | |
3 | 0.985 | 0.053 | 0.918 | 0.988 | |
4 | 0.985 | 0.051 | 0.917 | 0.988 | |
0.75 | 1 | 0.987 | 0.058 | 0.925 | 0.990 |
2 | 0.987 | 0.058 | 0.925 | 0.990 | |
3 | 0.987 | 0.058 | 0.924 | 0.990 | |
4 | 0.987 | 0.059 | 0.925 | 0.990 | |
1 | 1 | 0.987 | 0.062 | 0.927 | 0.990 |
2 | 0.987 | 0.062 | 0.927 | 0.990 | |
3 | 0.987 | 0.061 | 0.927 | 0.990 | |
4 | 0.987 | 0.063 | 0.927 | 0.990 | |
2 | 1 | 0.987 | 0.068 | 0.930 | 0.991 |
2 | 0.987 | 0.068 | 0.930 | 0.991 | |
3 | 0.987 | 0.068 | 0.930 | 0.991 | |
4 | 0.987 | 0.069 | 0.930 | 0.991 | |
4 | 1 | 0.987 | 0.071 | 0.930 | 0.991 |
2 | 0.987 | 0.071 | 0.930 | 0.991 | |
3 | 0.987 | 0.071 | 0.930 | 0.991 | |
4 | 0.987 | 0.071 | 0.930 | 0.991 | |
6 | 1 | 0.987 | 0.072 | 0.930 | 0.991 |
2 | 0.987 | 0.072 | 0.930 | 0.991 | |
3 | 0.987 | 0.072 | 0.930 | 0.991 | |
4 | 0.987 | 0.072 | 0.930 | 0.991 | |
8 | 1 | 0.987 | 0.072 | 0.930 | 0.991 |
2 | 0.987 | 0.072 | 0.930 | 0.991 | |
3 | 0.987 | 0.072 | 0.930 | 0.991 | |
4 | 0.987 | 0.072 | 0.930 | 0.991 | |
N Enumeration Districts | 134,017 | 9,182 | 69,814 | 55,021 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. All statistics are for the nongroup quarters population.
While raw population counts are important statistics, decennial census data are also used extensively for tracking how communities have grown over time. To assess how DP might affect growth rates, Table 2 presents distributional statistics of differences in the growth rate in county population between 1930 and 1940. These are constructed using the true county population in 1930 and comparing it to either the true population in 1940 or the noise-injected population count in 1940. The results show that growth rates are particularly sensitive to the choice of ϵ and that DP noise injection can create extreme differences between true and noise-injected values. For example, for ϵ = 0.25, 5% of counties experience growth rates that are almost 50 percentage points less than their true growth rate. Again, as ϵ approaches 8, these differences become relatively minimal. Nonetheless, since a given global ϵ is likely to produce more noise in the 2020 Census than in the 1940 Census, these results still raise concerns about the effects of DP noise on the accuracy of county growth statistics.
|
| Percentage Change Distribution | ||||
N | 5th Percentile | 25th Percentile | Median | 75th Percentile | 95th Percentile | |
0.25 | 3096 | -47.55% | -5.85% | -0.38% | 5.48% | 40.59% |
0.50 | 3096 | -21.80% | -2.71% | -0.01% | 3.51% | 25.61% |
0.75 | 3096 | -16.19% | -1.97% | -0.09% | 2.04% | 16.54% |
1 | 3096 | -12.08% | -1.52% | -0.02% | 1.67% | 13.04% |
2 | 3096 | - 6.16% | -0.72% | 0.00% | 0.99% | 7.45% |
4 | 3096 | - 3.39% | -0.42% | 0.00% | 0.50% | 3.94% |
6 | 3096 | -2.61% | -0.30% | 0.00% | 0.33% | 2.37% |
8 | 3096 | -1.60% | -0.21% | 0.00% | 0.26% | 2.14% |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. The unit of observation is the county. Cells show the distribution of difference in percentage population growth from 1930 to 1940 between DP and true data (DP change–true change). Percentiles represent the percentiles of the distribution of percentage change. Population growth is calculated using true county population in 1930 Census and either DP or true population from 1940 Census data.
To determine how differences in descriptive statistics might affect other real-world uses of decennial census data, we now run through three separate uses cases and analyze how DP data would affect the results. For each use case, we discuss the methodology used, the results, and any specific caveats to keep in mind when attempting to extrapolate the results of this analysis to the 2020 Census setting.
To produce accurate statistics for both the total population and subgroups of the population, surveys are often designed to oversample the subgroup(s) of analytical interest (Barron et al., 2015; Biemer & Lyberg, 2003; Cochran, 1977; Kalton, 2009; Kalton & Anderson, 1986). Given that oversampling relies on targeting these populations using decennial census data, DP noise injection could potentially make the process of oversampling less efficient. Specifically, the more noise that is added to the data, the less efficient any oversampling strategy will be. In order to estimate how DP will affect this process, we introduce a potential oversampling procedure and then show how the results of this procedure will differ when using noise-injected data. While additional methodologies such as moving to a panel survey, sampling frame enhancements, and expanding use of administrative records may help to ameliorate some of the negative effects shown here, the exact methodological details of these methods are unclear and they would come with their own start-up costs. Therefore, we consider them outside the scope of the current analysis.
We consider the specific example of creating an oversample of African Americans in an existing sample survey. First, we construct a sampling frame of specific geographies from across the country to mimic a standard survey frame. In particular, we select 150 geographic areas of larger than 500,000 individuals to comprise our primary sampling units. From each geographic area, we draw four enumeration districts for rural areas and eight enumeration districts for urban areas. This yields a survey frame that includes 800 enumeration districts in total. This process creates geographic clustering in the sample, which helps to limit the costs of in-person interviewing. Note that while not shown here, results using the entire nation as a sampling frame are qualitatively similar.
Enumeration districts are classified into two domains based on whether the concentration of African Americans is above a threshold. We set this threshold at 30%, but our results are qualitatively similar to modest changes in this value. Denote these domains as H and L, where the total number of individuals in H and L are denoted by
There are two stages to sampling. In the first stage, enumeration districts are selected with probability proportional to the size of the district population. Enumeration districts in H are oversampled, where the oversampling is governed by a parameter b such that the probability that any single district is selected is b times greater for enumeration districts in H than for districts in L. In the second stage of sampling, a fixed number of individuals are randomly selected from each of the enumeration districts selected in the first stage. For concreteness, suppose we design a sample of 1,000 individuals. The fractions of individuals selected for the survey are denoted as follows:
where b is a parameter controlling the degree of oversampling. As b→∞, only individuals in the high-density domain are sampled. In the primary results below, we use b = 2, but results are qualitatively similar for other common values of b.
The expected number of African Americans in the resulting 1,000-person sample will be
From our hypothetical sampling frame, we select a sample from the noise-injected data using the methodology described. We then analyze how many individuals in our sample would be African American using the true data, and how many households would need to be sampled in order to ensure that 1,000 African Americans are in the sample. Throughout, we assume that 100% of individuals contacted would complete the survey. Provided that response rates are constant across groups, survey nonresponse should affect sampling using both DP and true data equally.
Table 3a provides statistics on the sample representation using this process. In the first row, we see that with the unaltered data, 20.5% of the sample will be African American. With relatively small ϵ, this number drops slightly to 20.1% for run 1. This is a small change though, and as ϵ increases, the percentage of African Americans in the sample based on DP data appears very similar to what would be obtained from a survey using the unaltered data. Note also that the variation among runs is relatively minor. Even for the most extreme difference when ϵ = 0.25, there is only a 0.26 percentage-point difference in the composition of the survey sample.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 20.53% | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 20.12% | 20.14% | 20.38% | 20.14% |
ϵ = 0.50 | 20.31% | 20.41% | 20.32% | 20.37% |
ϵ = 0.75 | 20.46% | 20.43% | 20.33% | 20.42% |
ϵ = 1 | 20.42% | 20.50% | 20.41% | 20.44% |
ϵ = 2 | 20.50% | 20.49% | 20.47% | 20.48% |
ϵ = 4 | 20.51% | 20.49% | 20.49% | 20.47% |
ϵ = 6 | 20.50% | 20.51% | 20.50% | 20.50% |
ϵ = 8 | 20.50% | 20.50% | 20.50% | 20.50% |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the fraction of the sample that will be African American. Sample is drawn by sampling enumeration districts with probability proportional to size, and sampling enumeration districts that are at least 30% African American at twice the rate. Of enumeration districts in the sample, 147 of 800 are in the high sampling rate stratum. Response rates across groups assumed to be constant at 100%.
As another way of presenting this information, Table 3b shows the number of completed surveys that would be needed in order to obtain 1,000 African American respondents. This shows the same pattern: there are very small differences for low ϵ, but as ϵ increases the results using the DP data look very similar to those using the true unaltered data. While not shown here, results using an oversampling parameter anywhere in the range of 5–10 are qualitatively similar. Therefore, for a relatively large subgroup such as African Americans, the projected decrease in sampling efficiency of oversampling is likely to be small.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 4870 | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 4969 | 4966 | 4907 | 4965 |
ϵ = 0.50 | 4924 | 4900 | 4922 | 4910 |
ϵ = 0.75 | 4886 | 4895 | 4919 | 4898 |
ϵ = 1 | 4897 | 4878 | 4901 | 4894 |
ϵ = 2 | 4879 | 4882 | 4884 | 4883 |
ϵ = 4 | 4876 | 4880 | 4879 | 4885 |
ϵ = 6 | 4879 | 4877 | 4878 | 4878 |
ϵ = 8 | 4878 | 4878 | 4878 | 4877 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the fraction of the sample that will be African American. Sample is drawn by sampling enumeration districts with probability proportional to size, and sampling enumeration districts that are at least 30% African American at twice the rate. Of enumeration districts in the sample, 147 of 800 are in the high sampling rate stratum. Response rates across groups assumed to be constant at 100%.
If a subgroup is rare enough, oversampling the group as part of a general population survey may be cost prohibitive. A more cost-effective alternative may be to conduct a survey that screens individuals in order to create a sample of only individuals from the rare subgroup. To make this screening operation as efficient as possible, a sample might be designed with a heavy oversample of areas with high concentrations of the rare subgroup, and, as with the oversampling case shown, DP noise injection would lead to a loss of efficiency in this process. Below, we discuss the mechanisms of how this process works and then provide evidence on the potential effects of DP noise injection in the case of surveying the adult AIAN population.
Consider the population of adult AIAN individuals. Similar to our first use case, denote the entire adult population size as
Let
If we wish to obtain a sample of 1,000 AIAN adults, then we will need to draw a sample of the following size:
where
With DP data, both the strata definitions and the density of AIAN adults in each strata will be altered from their true values. Therefore, the noise injected into the data by DP could lead to decreases in efficiency. In addition, the results will show a loss in coverage for certain segments of the population if the noise induced by DP leads enumeration districts that contain AIAN adults to appear as having no AIAN adult population.
Note also that in the results, the expected number of completes needed to obtain a sample of a given size will not be comparable to those for the African American oversampling result in Section 4. This is due to the fact that the concentration of these populations differs as well as the fact that we use an oversampling parameter (b) of 2 in Section 4 and an oversampling parameter of 5 here in Section 5.
Using all enumeration districts in the United States in 1940 as the sampling frame, we calculate the resulting fraction of the sample that will be adult AIAN individuals and use this to estimate how many screeners will be needed to achieve a sample of 1,000 adult AIAN individuals. As before, we assume that there is perfect response given that any survey nonresponse will affect the operations using DP data or true data equally.
Table 4 presents the results of this exercise. Each row cell contains the number of screeners needed in expectation to achieve a sample of 1,000 adult AIAN individuals for a specific value of ϵ and run of the DP noise algorithm. Given how rare this population is, it would take significant effort to achieve this sample—even with the true data, it would require 2,691 screeners to achieve this sample. The results using the DP data show modest increases in the number of screeners for low values of ϵ. For example, when ϵ = 0.25, 3,172–3,297 screeners are needed to obtain the 1,000-individual sample. As ϵ increases, the results are almost identical to those using the true data. Across all values of ϵ, the results are relatively stable across all runs of the DP algorithm.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 2691 | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 3172 | 3178 | 3297 | 3180 |
ϵ = 0.50 | 2896 | 2914 | 2908 | 2911 |
ϵ = 0.75 | 2817 | 2812 | 2826 | 2830 |
ϵ = 1 | 2788 | 2786 | 2785 | 2792 |
ϵ = 2 | 2745 | 2741 | 2742 | 2743 |
ϵ = 4 | 2725 | 2718 | 2722 | 2722 |
ϵ = 6 | 2719 | 2717 | 2716 | 2716 |
ϵ = 8 | 2714 | 2712 | 2712 | 2714 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate.
To translate these results to costs, we need to make an assumption about the cost of a screener interview relative to a full survey interview. Given that relative costs could vary from survey to survey, we analyze a variety of cost ratios. We simplify slightly by abstracting from the fixed costs of running a survey operation and assuming that all costs are either due to screener interviews or survey interviews. Incorporating fixed costs would lead to smaller impacts of DP noise on survey costs.
Table 5 shows the relative increase in costs based on assumptions that the full survey interview is the same cost, twice as expensive, 5 times as expensive, or 10 times as expensive as a screener interview. Clearly, as the cost of a full interview relative to a screener interview increases, the effect of DP noise injection on costs decreases. For ϵ = 0.25, the increase in costs depends strongly on the assumptions made about interview cost structures. With relatively more expensive screener interviews, survey costs increase by more than 10%. As the screeners become relatively less expensive, the increase in costs is more modest. For values of ϵ above 1, the increase in costs is modest regardless of assumption about cost structure. In theory, these results could differ by both the degree of oversampling as well as the decision of where to split the stratum. While not reported here, our results are robust to changes in both the oversampling parameter as well as the choice of where to divide the enumeration districts into high- and low-density strata.
Cost of Interview Relative to Screener | ||||
1x (same cost) | 2x | 5x | 10x | |
ϵ = 0.25 | 13.02% | 10.25% | 6.25% | 3.79% |
ϵ = 0.50 | 5.56% | 4.37% | 2.67% | 1.62% |
ϵ = 0.75 | 3.42% | 2.69% | 1.64% | 0.99% |
ϵ = 1 | 2.63% | 2.07% | 1.26% | 0.77% |
ϵ = 2 | 1.45% | 1.14% | 0.70% | 0.42% |
ϵ = 4 | 0.92% | 0.72% | 0.44% | 0.27% |
ϵ = 6 | 0.75% | 0.59% | 0.36% | 0.22% |
ϵ = 8 | 0.62% | 0.49% | 0.30% | 0.18% |
Note. From 1940 Census demonstration DP data and IPUMS-USA 1940 full count data. Cells contain the percent increase in cost to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate. Columns differentiate the cost of a full interview relative to a theoretically shorter screener interview.
Another way to visualize this phenomenon is to examine the relationship between true and DP percent AIAN across enumeration districts. Figure 3 presents scatterplots of the relationship between true percent AIAN and the percent AIAN in the noise-injected data, with each dot weighted by the size of the enumeration district. We can see that with large ϵ, the dots become clustered around the 45-degree line, representing much less noise being added to the data. However, for small values of ϵ there is significant dispersion around the 45-degree line and many enumeration districts experience large percentage changes in their AIAN population due to the addition of noise.
In order to assess whether these results are driven by enumeration districts with very rare populations, Tables 6a and 6b present the same results as Table 5 but restricted either to districts of at least 100 or 1,000 total population. Specifically, for both the true and noise-injected data, we consider sampling only enumeration districts of a given minimum size. While the results in Table 6a for enumeration districts of at least 100 population are very similar to those in Table 5, the results in Table 6b show that when restricting only to enumeration districts of 1,000 population, the screening operation is equally efficient for both true and noise-injected data. Therefore, if only 1,000-person geographic areas were available for sampling in this data, the noise injection would have little influence. This may be important in many modern settings, where slightly larger geographic units such as census tracts are often used for sampling purposes. However, if survey designers need to access data for smaller geographic areas, then there likely would be a loss in efficiency.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 2712 | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 3065 | 3094 | 3238 | 3107 |
ϵ = 0.50 | 2847 | 2846 | 2849 | 2835 |
ϵ = 0.75 | 2782 | 2768 | 2783 | 2797 |
ϵ = 1 | 2762 | 2766 | 2756 | 2765 |
ϵ = 2 | 2739 | 2738 | 2736 | 2738 |
ϵ = 4 | 2733 | 2727 | 2730 | 2733 |
ϵ = 6 | 2733 | 2728 | 2731 | 2728 |
ϵ = 8 | 2729 | 2727 | 2727 | 2730 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate. Both results for true data as well as DP data are restricted to enumeration districts of at least 100 population, which represents 93.1% of all enumeration districts.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 3011 | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 3016 | 3085 | 3160 | 3116 |
ϵ = 0.50 | 3027 | 3021 | 2980 | 3065 |
ϵ = 0.75 | 3007 | 3008 | 3035 | 3003 |
ϵ = 1 | 2960 | 3014 | 3030 | 3015 |
ϵ = 2 | 3041 | 3016 | 2997 | 3023 |
ϵ = 4 | 3014 | 3020 | 3006 | 3018 |
ϵ = 6 | 3024 | 3011 | 3027 | 3019 |
ϵ = 8 | 3040 | 3020 | 3025 | 3023 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Cells contain the number of screeners to obtain a 1,000-person adult AIAN sample. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates. We assume that all AIAN individuals agree to participate.
All in all, the results of DP noise injection on the efficiency of screeners are relatively modest. While the projected cost increase for ϵ = 0.25 is large, the noise from DP does not have a large effect on screening costs for values of ϵ above 1. We now show that this is a product of a lack of coverage for the sampling frame for small values of ϵ. In particular, areas with adult AIAN individuals are not being represented in the survey sampling frame because the noise from DP makes the area appear to have 0 adult AIAN population. To assess this lack of coverage, Table 7 presents results on the fraction of the adult AIAN population that is covered by the sampling frame for different values of epsilon and runs of the DP algorithm. With small values of ϵ, there is significant undercoverage of the survey frame and roughly 30% of adult AIAN individuals are not represented. Thus, there is potential for important bias in estimators of population parameters of interest. Given appropriate control totals, calibration estimators may be effective in limiting the bias due to undercoverage.
Run | ||||
1 | 2 | 3 | 4 | |
True Data | 100% | |||
Noise-Injected Data | ||||
ϵ = 0.25 | 70.64% | 70.78% | 68.85% | 70.52% |
ϵ = 0.50 | 81.92% | 81.54% | 81.71% | 81.50% |
ϵ = 0.75 | 86.57% | 86.67% | 86.48% | 86.63% |
ϵ = 1 | 88.90% | 88.96% | 88.45% | 88.60% |
ϵ = 2 | 93.11% | 92.91% | 92.88% | 93.03% |
ϵ = 4 | 95.83% | 95.80% | 95.76% | 95.84% |
ϵ = 6 | 97.08% | 97.05% | 96.78% | 97.07% |
ϵ = 8 | 97.72% | 97.71% | 97.80% | 97.64% |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Figures in cells refer to the percent of the adult AIAN population in the true data that resides in an enumeration district with 0 AIAN adult individuals in the noise-injected data. Enumeration districts are sampled with probability proportional to size of the adult AIAN population, and sampling enumeration districts with at least 30% AIAN adults at five times the rates.
This undercoverage becomes smaller as ϵ increases, but even modest values such as ϵ = 1 have over 10% undercoverage. Nonetheless, when ϵ = 8, the undercoverage is only a couple of percentage points. Given that the sampling process is based on averages across areas, it is likely that the undercoverage issues we are seeing are largely due to the noise injection per se, and not additional pieces of the DP algorithm such as postprocessing or invariants. While the inclusion of these additional features is likely to have some effect on the types of areas with more accurate data, they are likely a second-order issue in the context of the tract-level data being used to drive sampling decisions.
Note that in theory another solution to undercoverage would be to add a separate stratum to the sampling design for enumeration districts with zero reported adult AIAN population. This method would potentially eliminate bias yet result in an increase, possibly a very large increase, in the cost of the survey, which may be practically infeasible. Given fixed cost, the method would increase the variance and diminish the precision of survey estimators. Still another solution may involve a dual-frame sampling design in which the second sampling frame is a list rich in AIAN adults obtained through administrative sources (records).
The federal government distributed about $721 billion (about 16% of its budget) to states and localities in fiscal year 2019, providing about one-quarter of these governments’ total revenues. About 61% of those funds were dedicated to health care, 16% to income security programs, and 9% each to transportation and education, training, employment, and social services (Tax Policy Center, 2020). The distribution was based, in part, on decennial census population counts and related data obtained from the American Community Survey and other Census Bureau programs. Further, the distribution relied on data collected in various large government surveys, whose sampling designs relied on measures of size derived from the decennial census. Given the amounts of money involved, the importance of accurate census data is undeniable. Thus, it is important to reach an understanding of how DP noise injection might distort such fund allocations in future years.
Toward this end, we consider a hypothetical program that allocates funding to areas on the basis of the number of children under 18 years residing in the area (excluding group quarters). We take a full budget of $5 billion to be allocated across areas in proportion to the number of children in an area, which corresponds roughly to $125 a child. We then calculate the allocated per-child funding in both the DP and true data. We do this separately for both enumeration districts and counties. This analysis relates to the prior literature on the effects of error in population estimates on funding allocations, but is more specific in that it considers the effects of DP and uses 1940 data as a test case (National Research Council, 2003; Spencer, 1980; Spencer et al., 2017; Zaslavsky & Schirm, 2002).
Table 8a shows the distribution, across counties, of misallocated dollars per child. When restricting our attention to counties, the misallocation is relatively modest. Even for ϵ = 0.25, the 10th and 90th percentiles are roughly $3 per child in either direction. While this is modest as a percentage of total funding, it may still be large enough to cause concerns for districts that depend on the funds.
| Percentile Misallocation |
| ||||
10 | 25 | 50 | 75 | 90 | Standard Deviation | |
ϵ = 0.25 | -3.29 | -1.76 | -0.56 | 0.85 | 3.09 | 19.47 |
ϵ = 0.50 | -2.19 | -1.35 | -0.61 | 0.32 | 1.70 | 12.45 |
ϵ = 0.75 | -1.91 | -1.24 | -0.69 | 0.08 | 1.26 | 3.73 |
ϵ = 1 | -1.71 | -1.19 | -0.71 | -0.03 | 1.08 | 7.29 |
ϵ = 2 | -1.44 | -1.13 | -0.79 | -0.22 | 0.78 | 13.87 |
ϵ = 4 | -1.09 | -0.82 | -0.33 | 0.61 | 3.81 | |
ϵ = 6 | -1.29 | -1.10 | -0.85 | -0.36 | 0.58 | 11.65 |
ϵ = 8 | -1.29 | -1.10 | -0.86 | -0.38 | 0.62 | 2.27 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data), where percentile misallocation refers to the percentile of the empirical distribution of per child misallocation. Overall per child funding ≈$125.
| Percentile Misallocation |
| ||||
10 | 25 | 50 | 75 | 90 | Standard Deviation | |
ϵ = 0.25 | -57.19 | -21.15 | -1.91 | 17.91 | 55.83 | 577.48 |
ϵ = 0.50 | -28.97 | -10.91 | -1.02 | 9.74 | 30.79 | 484.22 |
ϵ = 0.75 | -19.76 | -7.53 | -0.77 | 6.68 | 21.30 | 432.59 |
ϵ = 1 | -15.12 | -5.90 | -0.73 | 5.16 | 16.95 | 417.06 |
ϵ = 2 | -8.18 | -3.43 | -0.69 | 2.67 | 9.16 | 408.98 |
ϵ = 4 | -4.71 | -2.23 | -0.73 | 1.32 | 5.31 | 411.04 |
ϵ = 6 | -3.52 | -1.84 | -0.78 | 0.83 | 3.95 | 414.30 |
ϵ = 8 | -3.50 | -1.64 | -0.82 | 0.58 | 3.88 | 412.44 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data), where percentile misallocation refers to the percentile of the empirical distribution of per child misallocation. Overall per child funding ≈$125.
Table 8b presents the results for enumeration districts, which shows that the misallocation could be quite extreme. In fact, for ϵ = 0.25, the 10th and 90th percentiles of per-child misallocation are over $50 in absolute value, which is almost half of the original allocation. As another perspective, Figure 4 presents the full distribution of differences in allocation for various levels of ϵ. For low values of ϵ, there is a large mass at −$125, indicating enumeration districts that theoretically would lose all of their funding due to DP noise. As ϵ increases, the distribution tightens around 0 and the misallocation becomes less pronounced. Nonetheless, for many values of ϵ, the noise induced by DP leads to large misallocations of funds. Note that there is also a slight shift in the distribution of funds, as the median funding difference is negative. This result is an artifact of the postprocessing, which restricts population counts to be greater than zero while also restricting state-level population to be invariant. While these effects are noteworthy, on the whole the primary patterns in the results are driven by the noise injection procedure itself and not postprocessing.
Given that these results could be driven by enumeration districts with very rare populations, Table 9a and 9b provides the same results as Table 8b but restricted to enumeration districts of at least 100 or 1,000 individuals. While low values of ϵ still lead to pronounced misallocations, the results show that there is less misallocation when restricting to these larger enumeration districts. This highlights the fact that the DP procedure produces larger amounts of error for smaller areas precisely because it is these smaller areas that pose more privacy risk.
| Percentile Misallocation |
| ||||
10 | 25 | 50 | 75 | 90 | Standard Deviation | |
ϵ = 0.25 | -48.33 | -19.51 | -1.83 | 16.58 | 47.58 | 182.82 |
ϵ = 0.50 | -24.76 | -10.23 | -1.01 | 8.90 | 25.65 | 134.59 |
ϵ = 0.75 | -16.89 | -7.07 | -0.76 | 6.11 | 17.71 | 101.71 |
ϵ = 1 | -13.15 | -5.58 | -0.75 | 4.63 | 13.86 | 100.14 |
ϵ = 2 | -7.20 | -3.29 | -0.70 | 2.38 | 7.48 | 86.89 |
ϵ = 4 | -4.22 | -2.17 | -0.73 | 1.14 | 4.28 | 83.89 |
ϵ = 6 | -3.21 | -1.81 | -0.77 | 0.71 | 3.18 | 83.18 |
ϵ = 8 | -2.72 | -1.62 | -0.81 | 0.48 | 2.63 | 83.27 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data). Overall per child funding ≈$125.
| Percentile Misallocation |
| ||||
10 | 25 | 50 | 75 | 90 | Standard Deviation | |
ϵ = 0.25 | -25.61 | -12.47 | -1.65 | 9.67 | 24.51 | 84.50 |
ϵ = 0.50 | -13.55 | -6.75 | -1.01 | 5.14 | 13.04 | 63.19 |
ϵ = 0.75 | -9.43 | -4.75 | -0.76 | 3.55 | 9.22 | 50.99 |
ϵ = 1 | -7.41 | -3.84 | -0.77 | 2.65 | 7.07 | 61.86 |
ϵ = 2 | -4.23 | -2.38 | -0.69 | 1.36 | 4.07 | 31.97 |
ϵ = 4 | -2.71 | -1.69 | -0.69 | 0.64 | 2.54 | 15.25 |
ϵ = 6 | -2.20 | -1.51 | -0.72 | 0.41 | 2.14 | 11.04 |
ϵ = 8 | -1.93 | -1.29 | -0.75 | 0.29 | 1.93 | 11.21 |
Note. From 1940 Census demonstration differentially private (DP) data and IPUMS-USA 1940 full count data. Results are shown only for run 1 of the differential privacy algorithm; results from other runs are qualitatively similar. Figures in cells represent statistics for the per child misallocation (funding using DP data–funding using true data). Overall per child funding ≈$125.
While DP noise injection algorithms provide strong guarantees for protecting respondent privacy, they potentially diminish the utility of census data that have been collected at the cost of roughly $10 billion. We have analyzed the lost utility in terms of three separate use cases: 1) building an oversample of African Americans in a general population survey, 2) designing a survey with a screening instrument to interview only AIAN adults, and 3) allocating funding to local areas on the basis of number of children in an area. Because the effect of DP noise on data utility can vary widely depending on the particular use case and DP noise algorithm, it is important to analyze the effects of noise injection on data utility across a variety of scenarios.
Our results show that the effect of noise varies widely by use case and value of the privacy protection parameter, ϵ. When looking at statistics for relatively large groups (such as the use case of oversampling African Americans), the effect of DP noise injection is relatively modest. Nonetheless, uses of decennial census data that depend on accuracy for rare demographic groups or small geographic areas can be significantly affected by the additional noise. In particular, we find that unless ϵ is large, for a survey of the adult AIAN population DP noise injection can lead to both a large loss of survey coverage as well as moderate cost increases to achieve a constant sample size. Moreover, our results indicate that small values of ϵ can also lead to substantial misallocations of funding for small geographic areas. As either ϵ or population size increases, these misallocations due to DP noise become more modest.
It is also worth noting that for the 2020 Census, the privacy budget, ϵ, will be spread across more tables of data and more geographies than was done for the 1940 data. Therefore, the values of ϵ considered in our analysis of 1940 data are unlikely to translate to similar losses of data utility in the 2020 Census. Whether this results in more or less data utility for a given level of privacy is not yet known. On one hand, because the privacy budget must be spread across many more tables in the 2020 Census, the amount of noise added to the data for a given ϵ may be much larger in 2020. In addition, the 1940 Census data we analyze here contains a limited set of variables and much different geographic definitions than will be released in 2020. On the other hand, future refinements in the algorithm could lead to greater efficiencies. Because ϵ is a measure of privacy and not of data utility, there may be additional methodological refinements that could significantly improve utility for data users. For example, additional analyses in the December 2019 Committee on National Statistics Workshop on 2020 Data Products have shown that more recent vintages of DP demonstration data lead to very few effects on survey efficiency, but can lead to decreases in coverage for small populations (Brummet, 2019). Nonetheless, as DP methodologies continue to evolve, it will be important to conduct additional analyses of the kind described here, both to measure the impacts of DP on data privacy and utility and to illuminate the appropriate balance between these competing ends.
We are grateful for support from the Alfred P. Sloan Foundation (award G-2019-12589).
Abowd, J., Ashmead, R., Garfinkel, S., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, A., & Sexton, W. (2019). Census topdown: Differentially private data, incremental schemas, and consistency with public knowledge. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0945_Consistency_for_Large_Scale_Differentially_Private_Histograms.pdf
Abowd, J. M., & Schmutt, I. A. (2019). An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review, 109(1), 171–202. https://doi.org/10.1257/aer.20170627
Ashmead, R., Kifer, D., Leclerc, P., Machanavajjhala, A., & Sexton, W. (2019). Effective privacy after adjusting for invariants with applications to the 2020 Census. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0941_Effective_Privacy_after_Adjusting_for_Constraints__With_applications_to_the_2020_Census.pdf
Barron, M., Davern, M., Montgomery, R., Tao, X., Wolter, K. M., Zeng, W., Dorell, C., & Black, C. (2015). Using auxiliary sample frame information for optimum sampling of rare populations. Journal of Official Statistics, 31(4), 545–557. https://doi.org/10.1515/jos-2015-0034
Biemer, P. P., & Lyberg, L. E. (2003). Introduction to survey quality (Vol. 335). John Wiley & Sons. https://doi.org/10.1002/0471458740
boyd, d. (2019). Differential privacy in the 2020 Census and the implications for available data products. arXiv. https://doi.org/10.48550/arXiv.1907.03639
Brummet, Q. (2019). The effect of differential privacy on survey operations. Paper presented at National Academies of Sciences, Engineering and Medicine, Committee on National Statistics Workshop on 2020 Census Data Products, Washington, DC, December 2019. https://www.nationalacademies.org/event/12-11-2019/workshop-on-2020-census-data-products-data-needs-and-privacy-considerations
Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley & Sons. https://www.wiley.com/en-us/Sampling+Techniques%2C+3rd+Edition-p-9780471162407
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/0400000042
Garfinkel, S. L., Abowd, J. M., & Powazek, S. (2018). Issues encountered deploying differential privacy. In Proceedings of the 2018 Workshop on Privacy in the Electronic Society (pp. 133–137). ACM. https://doi.org/10.1145/3267323.3268949
Kalton, G., & Anderson, D. W. (1986). Sampling rare populations. Journal of the Royal Statistical Society: Series A, 149(1), 65–82. https://doi.org/10.2307/2981886
Kalton, G. (2009). Methods for oversampling rare subpopulations in social surveys. Survey Methodology, 35(2), 125–141. https://doi.org/10.2307/2981886
Kifer, D., & Machanavajjhala, A. (2014). Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 39(1), Article 3. https://doi.org/10.1145/2514689
Leclerc, P. (2019). Guide to the Census 2018 end-to-end test disclosure avoidance algorithm and implementation. Working Paper. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0938_2018_E2E_Test_Algorithm_Description.pdf
Li, C., Miklau, G., Hay, M., McGregor, A., & Rastogi, V. (2015). Optimizing linear counting queries under differential privacy. The VLDB Journal, 24(6), 757–781. https://doi.org/10.1145/1807085.1807104
McClure, D., & Reiter, J. (2012). Differential privacy and statistical disclosure risk measures: An investigation with binary synthetic data. Transactions on Data Privacy, 5(3), 535–552. https://dl.acm.org/doi/10.5555/2423656.2423658
McKenna, L. (2018). Disclosure avoidance techniques used for the 1970 through 2010 Decennial Censuses of Population and Housing. Center for Economic Studies Working Paper 18–47, U.S. Census Bureau. https://www.census.gov/content/dam/Census/library/working-papers/2018/adrm/Disclosure%20Avoidance%20Techniques%20for%20the%201970-2010%20Censuses.pdf
National Research Council. (2003). Statistical issues in allocating funds by formula. National Academies Press. https://doi.org/10.17226/10580
Parsons V. L., Moriarity, C., Jonas, K., Moore, T. F., Davis, K. E., & Tompkins, L. (2014). Design and estimation for the National Health Interview Survey, 2006–2015. Vital and Health Statistics Series, 2(165), 1–53. https://pubmed.ncbi.nlm.nih.gov/24775908/
Reiter, J. P. (2019). Differential privacy and federal data releases. Annual Review of Statistics and Its Application, 6(85), 85–101. https://doi.org/10.1146/annurev-statistics-030718-105142
Ruggles, S., Flood, S., Goeken, R. Grover, J., Meyer, E., Pacas, J., & Sobek, M. (2018). IPUMS USA: Version 8.0 Extract of 1940 Census for U.S. Census Bureau Disclosure Avoidance Research [data set]. IPUMS. https://doi.org/10.18128/D010.V8.0.EXT1940USCB
Ruggles, S., Fitch, C., Magnuson, D., & Schroeder, J. (2019). Differential privacy and census data: Implications for social and economic research. AEA Papers and Proceedings, 109, 403–408. https://doi.org/10.1257/pandp.20191107
Smith, T. W., Davern, M., Freese, J., & Morgan, S. L. (2019). General Social Surveys, 1972–2018: Cumulative codebook, Appendix A. NORC. National Data Program for the Social Sciences Series, no. 25. https://gss.norc.org/documents/codebook/gss_codebook.pdf
Spencer, B. (1980). Benefit-cost analysis of data used to allocate funds. Springer. https://doi.org/10.1007/978-1-4612-6099-8
Spencer, B. D., May, J., Kenyon S., & Seeskin, Z. (2017). Cost-benefit analysis for a Quinquennial Census: The 2016 Population Census of South Africa. Journal of Official Statistics, 33(1), 1–26. https://doi.org/10.1515/jos-2017-0013
Tax Policy Center. (2020). Tax Policy Center briefing book: The state of state (and local) policy. https://www.taxpolicycenter.org/briefing-book/what-types-federal-grants-are-made-state-and-local-governments-and-how-do-they-work#:~:text=The%20federal%20government%20distributed%20about,of%20these%20governments'%20total%20revenues
Vadhan, S. (2017). The complexity of differential privacy. In Y. Lindell (Ed.), Tutorials on the foundations of cryptography (pp. 347–450). Springer. https://doi.org/10.1007/978-3-319-57048-8_7
Wolter, K. M., Polivka, A. E., & Lubich, A. (2015). Evolution of the Current Population Survey. In N. Balakrishnan, T. Colton, B. Everitt, W. Piegorsch, F. Ruggeri, & J. L. Teugels (Eds.), Wiley StatsRef: Statistics Reference Online. John Wiley & Sons. https://doi.org/10.1002/9781118445112.stat00063.pub2
Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., Honaker, J., Nissim, K., O’Brien, D. R., Steinke, T., & Vadhan, S. (2018). Differential privacy: A primer for a non-technical audience. Vanderbilt Journal of Entertainment & Technology Law, 21(17), 209–276. http://doi.org/10.2139/ssrn.3338027
Zaslavsky, A., & Shirm, A. (2002). Interactions between survey estimates and federal funding formulas. Journal of Official Statistics, 18, 371–391.
©2022 Quentin Brummet, Edward Mulrow, and Kirk Wolter. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.