This article explores the interface between disclosure protection and multiple dimensions of data quality (accuracy, accessibility, relevance, coherence, granularity, punctuality, and interpretability) in the dissemination of statistical information products and services. Operational constraints and related trade-offs receive special attention. The main ideas are placed in the context of previous literature, and summarized through several sets of questions.
Keywords: constrained optimization, decennial census, differential privacy, formal privacy, intangible capital, invariants, privacy budget, public goods, stakeholder communications, tiered access
In developing and implementing disclosure protection procedures for social and economic data, statistical organizations must address complex trade-offs among multiple features of the disclosure protection profile and multiple dimensions of data quality. Recent workshops and publications (including this special issue of Harvard Data Science Review), have expanded the horizon for the exploration of these trade-offs, especially in the areas of differential privacy and formal privacy protection.
For general background on these issues, see, for example, Abowd and Schmutte (2015, 2019), Abowd, Ashmead et al. (2019), Abowd, Schmutte et al. (2019), boyd (2019), Gong and Meng (2020), Hotz and Salvo (2020), Kearns and Roth (2019), Nissim and Wood (2018), the articles included in this special issue, and references cited therein. The materials in this special issue focus on differential privacy methods for the U.S. 2020 decennial census, which will be important in itself, and also will offer valuable insights for exploration of broader classes of potential applications.
This article provides some additional context for that exploration, with emphasis on the interface of disclosure protection with data quality and operational constraints. The exposition will use ‘disclosure protection’ as an umbrella term for a wide range of methods, for example, differential privacy (including the addition of noise and application of postprocessing constraints); other forms of noise infusion; rounding; swapping; cell suppression; and use of synthetic data. Section 2 summarizes the general mission of government statistical agencies, defines some terms used in the remainder of the article, reviews multiple types of data use for which disclosure protection is important, and outlines multiple dimensions of data quality considered important in assessment of the ‘fitness for use’ of statistical information products. Section 3 reviews legal, societal, and operational constraints that are important for practical work with disclosure protection and summarizes general approaches that statistical organizations can take in managing operations subject to such constraints. Within the context defined by Sections 2 and 3, Section 4 outlines the impact of disclosure protection on several aspects of accuracy. Similarly, Section 5 explores several ways in which disclosure protection intersects with concepts, policies, and practices for data access and also considers evaluation of stakeholder experiences with data accessibility and cost, with special emphasis on tiered access. Sections 6 through 10 consider the roles of disclosure protection and tiered access on, respectively, relevance, coherence, cross-sectional and temporal granularity, punctuality, and interpretability. Section 11 provides some closing remarks.
Sections 3 through 11 also include groups of questions that summarize some of the issues that warrant in-depth consideration in specific applications of disclosure protection methods. For some of these questions, we currently have a relatively strong base of methodological development, empirical information, and policy guidance to provide reasonably well-founded answers, at least for some practical cases; see, for example, the papers listed above, additional papers in the bibliography, and references cited therein. Some other questions are quite open, and we hope that the discussion in this special issue, and in subsequent research, will provide some progress toward sound answers. For persons especially interested in general statistical methodology, some of these questions may help in mapping connections between open problems in disclosure protection, and customary approaches to statistical methodology (e.g., through exploration of objective functions, constraints, conditioning, and the measurement and analysis of sources of uncertainty). For researchers focused on substantive analysis of data, some questions may provide traction for evaluating the impact of disclosure protection methods on the quality of their data, and of resulting analyses. For policy analysts, some questions may help to explore potential contributions of disclosure protection concepts and methods to clarification and resolution of competing interests that arise in work with tiered access and open data. Finally, for each of the above-mentioned groups, these questions may contribute to nuanced discussion of complex trade-offs required for realistic use of disclosure protection methods, and use of the resulting data.
Disclosure limitation problems—writ large—form an important subclass within the broader class of problems defined by the goals and operational environments of government statistical agencies. Those environments arise from the agencies’ dual mission:
They are expected to deliver high-quality statistical information products and services in the form of public goods to a range of stakeholders on a sustainable and cost-effective basis, within a complex set of societal and legal constraints (including privacy protection).
In addition, they are expected to adapt and enhance these statistical information products and services, and the production and delivery thereof, within a context defined by dynamic societal, technological, and data environments.
For work with disclosure protection, this dual mission leads to emphasis on (i) performance criteria and operating constraints, (ii) presentation and practical interpretation of empirical results related to these criteria, and (iii) identification of priorities for future methodological, empirical, and policy research.
The initial discussion will restrict attention to quality, risk, and cost factors that are important in the production of estimates for a pre-specified set of well-defined population parameters. This is a fairly standard approach to clarification of methodological controversies related to complex operations by statistical agencies; see, for example, Rubin (1996). Section 5 considers some related issues centered on microdata access, and on tiered access, rather than restricting attention to population aggregates as such.
To develop some notation, consider a group of estimands , ; common examples are population totals, means, ratios, or quantiles or related analytic parameters like regression coefficients or parameters from a generalized linear model or a hierarchical model. In addition, is a predetermined index set that covers the group of estimands and estimators of interest for this discussion. For example, in the cases considered by Asquith et al. (2022) and Brummet et al. (2022) , may the index set for some or all states, counties, or enumeration districts in the United States and may be a parameter based on person-level observations (e.g., a population total or proportion for a specified demographic group or a related index of dissimilarity), or may be based on more complex household-based data (e.g., the number of households with at least one person under the age of 18, and at least one parent thereof).
In addition, for each estimand , consider a point estimator . Articles in this HDSR special issue focus primarily on that are computed from data subject to ‘added noise’ procedures within a differential privacy context. One could consider estimators that are computed by application of standard unadjusted procedures to disclosure-protected microdata . On the other hand, some authors (e.g., Fuller, 1993) have suggested direct adjustment of estimators to account for ‘noise’ that has been added as part of a disclosure-protection effort. See, for example, Karr (2017) and Gong (2019) for additional developments in this approach to disclosure protection. For general background on adjustment of estimators to account for measurement errors and other forms of ‘noise,’ see, for example, Fuller (1987), Carroll et al. (2006), and references cited therein. Also, there is potential interest in estimators that are computed from synthetic microdata, as discussed in, for example, Vilhuber et al. (2016) and references cited therein. Finally, Abowd and Schmutte (2015) include a robust discussion of disclosure-aware data analysis, and the appendix includes a general Bayesian formulation.
Assessment of the practical impact of disclosure-limitation methods will generally depend on the context defined by the intended forms of data access. For example, much of the discussion in this special issue focuses on a pre-specified set of estimands , that are (sub)population totals, proportions, means, or quantiles. For work with a given data set, the resulting estimates are intended for public dissemination in the form of tables or related graphs or maps.
Other cases may focus on microdata (unit-level data) made available for researchers who intend to produce special tabulations or complex model-oriented analyses. Long-recognized microdata disclosure-protection issues have led statistical organizations to develop a range of techniques often described as ‘tiered access.’ For general background on tiered access, see, for example, Clark (2020), and references cited therein. For example, United Kingdom Data Service (2020) discusses a data access model with three tiers, labeled as ‘open’ (disseminated to the public without substantive restrictions), ‘safeguarded’ (subject to, e.g., licensing, data destruction provisions, authentication requirements, and other restrictions on types of users), and ‘controlled’ (with control managed under a ‘five safes’ rubric that includes: Safe Settings, Safe People, Safe Projects, Safe Output, and Safe Data). For additional background on the five safes approach, see Desai et al. (2016) and references cited therein.
Application of these concepts to specific cases generally involves balanced assessment of both the level of disclosure risk incurred through a given tier of access and the sensitivity of the data provided therein. See, for example, the “Confidentiality Classification Tool” in Statistics Canada (2020). In addition, see National Academies of Sciences, Engineering, and Medicine (NASEM, 2017b, figure 5-1) for alignment of multiple tiers of data protection with the degree of identifiability and the ‘level of expected harm’ to a respondent.
In addition, some aspects of the tiered access approach align with the use of synthetic data, as discussed in general by Vilhuber et al. (2016) and for specific cases, for example, the Survey of Income and Program Participation and the Longitudinal Business Database; and with the use of verification servers and validation servers, as discussed in, for example, Reiter et al. (2009) and Barrientos et al. (2018).
Finally, discussion of data access often emphasizes the use of data by academic researchers and policy analysts, but it is also important to consider appropriate forms of data access and use by state and local governments (e.g., for planning and assessment related to public health, education, workforce support, and economic development), commercial organizations, and students (e.g., for training in substantive empirical analysis areas, as well as general data analysis skills).
Data quality has received extensive attention in the statistical literature. For the current discussion, three general concepts are of special interest. First, data quality is inherently multi-dimensional, and practical decisions about the production and dissemination of statistical information will require complex trade-offs among various quality dimensions. Extending ideas in Brackstone (1999), NASEM (2017b, chap. 6), European Statistical System Committee (2019), and other work, the current article will place primary emphasis on seven dimensions of quality: accuracy, accessibility, relevance, coherence, granularity, punctuality, and interpretability. Sections 4 through 10 explore these dimensions in additional detail.
Second, the literature broadly acknowledges that for each dimension of data quality, practical evaluation depends heavily on context defined by the intended uses of those data and the anticipated users, often summarized by the term, ‘fitness for use.’ Such an approach generally requires one to distinguish among subsets of the overall set of estimands , , and their intended uses thereof, for example,
Descriptive statistics for large-scale aggregates, intended for the general public.
Analytic results from complex modeling of microdata, intended for specialized substantive research audiences, and based on rigorously pre-specified analytic plans. This might include formal study registration, as considered by, for example, Loder et al. (2010), Olken (2015), and Vilhuber (2020).
Analytic results from complex modeling of microdata, intended for specialized substantive research audiences, and based on highly exploratory work.
To some degree, these categories of intended uses align with the ‘tiered access’ concepts reviewed in Section 2.3.
Third, for any combination of the specified quality dimensions, in-depth analysis of fitness for use criteria depend heavily on specific features of a given application. Consequently, for evaluation of the interface between data quality and disclosure protection, there is special interest in the case studies considered in this special issue of HDSR, and other case studies presented during the December 11–12, 2019, Committee on National Statistics (CNSTAT) workshop on disclosure protection (Hotz & Salvo, 2020). The remainder of this article will refer briefly to some of those case studies, but a detailed discussion is beyond the scope of the current work.
In their efforts to provide the best balance of quality, risk, and cost in the production of statistical information, agencies must navigate carefully through an environment often dominated by legal, societal, and operational constraints. For example, constraints related to data quality may arise from formal legal or regulatory requirements and from the role of data quality in ensuring continued engagement in the process of data production and use by respondents, by users of published data products and related microdata access facilities, and by funding sources. In addition, disclosure protection policies and procedures involve a complex array of constraints related to multiple dimensions of quality, risk, and cost, as considered in the remainder of this section.
This navigation often encounters uncertainties regarding steps required for practical compliance with those constraints and related changes in those constraints that may arise from dynamics in the underlying legal, societal, economic, methodological, and technological environments. To clarify some of these issues for disclosure-protection work, this section outlines some issues related to the origins of dominant constraints, distinctions between ‘absolute’ and ‘somewhat flexible’ constraints, and optimization conditional on specified constraints.
Constraints that are important for disclosure-protection work in government statistical agencies can arise from several sources. Some are imposed by formal legal and regulatory requirements, for example, Title 13 of the United States Code, Title 26 of the United States Code, Confidential Information Protection and Statistical Efficiency Acts (CIPSEA) of 2002 and 2018, the Privacy Act of 1974, or the Information Quality Act, and related Statistical Policy Directives from the United States Office of Management and Budget.
Others constitute formal policies established through deliberation and decision processes by duly authorized groups within a specified statistical agency and consistent with their mandate under authorizing legislation and the above-mentioned legal and regulatory requirements and general societal expectations that statistical organizations will exercise a ‘duty of care’ in protecting the confidentiality of the data that they receive. In U.S. statistical agencies, these policies often are informed by recommendations from formal advisory groups convened under the Federal Advisory Committee Act of 1972 (FACA) or by public responses to Federal Register Notices. For cases in which data are obtained through written agreements with other data providers (e.g., administrative records from federal, state, or local government agencies), the applicable policies may also be formalized in said written agreements. Still others are based on longstanding and carefully considered general practices of numerous statistical agencies; see NASEM (2017a) for related background.
Finally, some constraints emerge from program-specific operational considerations or program-specific legacy practices (e.g., release of specified tabular or microdata products in a given format and with quality characteristics similar to those for previous releases). For any of these cases, and especially for legacy cases, it is important to determine whether the impact of those constraints on the quality/risk/cost profile may have changed substantially due to changes in the underlying societal, technological, or data environments.
Absolute Constraints. For statistical agencies, some constraints are generally considered “hard” or absolute in nature. For example, the U.S. Census Bureau has a legal mandate to publish certain estimates from the decennial census by specified deadlines. Conversely, U.S. statistical agencies are strictly prohibited from releasing the Social Security number of a survey respondent. Also, statistical agencies in the United States and other countries are subject to other disclosure-protection legislation and regulation that may receive absolutist interpretations.
In addition, it is worth noting that some legislation or regulation related to a wide range of risks for the general public may be written in absolutist terms; but maintaining the prescribed absolute limits may become difficult or impossible in light of subsequent technological developments, or deeper exploration of unintended consequences. To take an example from public health, the Delaney Clause of the Food Additives Amendment of 1958 for the Federal Food, Drug and Cosmetic Act set a zero level of carcinogenic risk as a criterion for approval of food additives by the U.S. Food and Drug Administration. However, comprehensive enforcement of this clause subsequently became problematic due to improvements in the detectability of trace amounts of chemicals, and advances in epidemiological and clinical studies of carcinogenesis. See, for example, National Research Council (1982) and Merrill (1988, 1997) for general background on the Delaney Clause and related controversies.
Somewhat Flexible Constraints. Statistical agencies also encounter numerous constraints that require achievement of a stated goal ‘to the maximum extent possible.’ In such cases, it is incumbent upon the statistical agency to develop a reasonably rigorous and transparent characterization and quantification of that goal and to justify empirically (e.g., through diminishing-returns analyses) the selection of the bounds proposed to satisfy the ‘maximum extent possible’ criterion. In the field of disclosure protection, the establishment of privacy-loss budgets may be viewed along these lines.
Adapting notation and concepts from Eltinge (2014) and Gonzalez and Eltinge (2016), some decision processes for statistical agencies can be viewed in a framework of constrained optimization. In particular, define a ‘performance profile’ vector that includes realistic measures of the dimensions of quality, risk, and cost considered for production of a set of estimates on a sustainable and cost-effective basis. In addition, consider a schematic model
where is a function of known form, is a vector of design factors (nominally) controlled by a statistical agency, is a vector representing environmental factors that are observable but not subject to control by the agency, is a remainder term representing the variability in not associated with changes in nor and is a vector of the parameters of the function and a related variance function for . For the current discussion, prominent examples of include societal perceptions related to trust in government statistical agencies and general views on privacy, confidentiality, and the sensitivity of specified types of data. Examples of include general decisions on selection of data sources and application of methodological and technological tools, general decisions on the use of differential privacy methods, and the choice of the numerical value of the privacy-loss budget term.
With that background, consider a partition of the performance vector where the subvector represents the performance characteristics that are subject to hard constraints, and the other subvector represents the performance characteristics for which we can consider trade-offs, conditional on satisfying the hard constraints in . Examples of hard constraints in could include budgetary limits based on appropriations laws, statutory requirements to publish estimates for specified terms with specified deadlines, contractual requirements to use estimators that have standard errors below specified bounds, and legal or regulatory requirements related to privacy and confidentiality.
Within this framework, methodologists and managers in statistical agencies can be viewed as seeking to do the following:
(a) Explore the extent to which the model (P -) should be viewed only as qualitative and schematic, or warranting some in-depth quantitative analysis. Analysts often use a quantitative approach in assessing the methodological properties of a given statistical procedure as reflected in the ‘accuracy’ dimension of data quality. However, each of the other dimensions of data quality considered in this article (accessibility, relevance, coherence, cross-sectional and temporal granularity, punctuality, and interpretability), and dimensions of risk and cost, can be viewed as including both components that are, respectively, potentially quantifiable (e.g., through a predictive model) or inherently qualitative. For cases in which the predominant dimensions of are inherently qualitative, decisions about design factors may be especially challenging. For cases in which some notable dimensions of are quantitative, one may consider the following additional analyses.
(b) Understand important features of the function and the parameter , to the extent feasible. This may include efforts to collect sufficient empirical information to produce reliable estimators of ; to assess the goodness-of-fit and predictive adequacy of model (P-) within specified zones defined by and and to explore whether the proposed vector may be omitting some environmental factors that are important predictors for some dimensions of , but are not currently measured. This general approach includes, but is not limited to, analyses developed previously for variance, hazard, and cost functions applied to data from sample surveys and administrative records and receiver operating characteristic (ROC)–type curves used in assessment of trade-offs between privacy protection and data quality, for example, figure 1 of Abowd et al. (2019).
(c) Gauge current or prospective values of the environmental factors , including assessment of the distribution of through either direct empirical evaluation, or through elicitation of priors from persons with expertise related to the factor . This assessment will be of special interest for cases in which is relatively sensitive to changes in .
(d) Evaluate the extent to which it is feasible and cost-effective for the statistical organization to change the settings of the design vector . This may require assessment of the technical and managerial capacity to control specific dimensions of ; costs related to that control; and the probability distribution of ‘slippage errors’ that occur for the common cases in which control over a given dimension of is imperfect.
(e) Based on information from (b), (c), and (d), determine the preferred setting of the design vector that will provide the best possible balance of outcomes represented by , conditional on satisfying the hard constraints identified in .
(f) Ensure that, to the extent reasonably possible, the wide range of agency stakeholders contribute constructively to broad discussion and consensus-building regarding the information in (b), (c), and (d), and the decisions in (e). This requires clear communication by all participants regarding the information described in (a)–(e), and the limitations on that information. This also requires stakeholders to state their priorities as clearly as possible, and to recognize that decisions in this space generally require judgment calls that involve complex trade-offs among competing interests, in the presence of imperfect information.
Related Questions on Constraints and Realistic Trade-Offs Among Performance Criteria:
Q1: Which requirements are truly mandatory, and thus warrant inclusion in the constraint subvector ? In addition, what is the precise framing of each of these mandatory constraints? For example: (a) Are confidentiality constraints stated in zero-risk terms, or in terms of bounds on reidentification risk or attribute risk? (b) Do mandates to publish estimates for specified estimands include numerical bounds on the accuracy of the resulting estimator ? (c) To what extent are open-government and open-data requirements satisfied through dissemination of data through channels like the Federal Statistical Research Data Center system, which requires disclosure-protection review of numerical results before placement into the public domain?
Q2: Among the performance criteria that are in the subvector (possibly including some criteria related to disclosure protection and data quality), what are the practical methods for measuring and reporting performance? In addition, do key stakeholders, and the general public, have realistic methods to gauge the impact of changes in ? What are the best ways to expand and communicate the set of realistic use cases that will help to assess said impact?
Q3: Based on the measures for considered in Q2, what are some practical methods (e.g., penalized objective functions) for exploration of trade-offs among the competing quality, risk, and cost components in ?
Q4: What are realistic expectations on the extent and ways in which statistical agencies and their key stakeholders will provide in-depth and transparent input that is responsive to the preceding questions? What are some feasible methods that can be used to focus the general exploratory discussion, and related specific negotiations about constraints, to account as much as possible for the complexities noted above? In addition, what are the norms for negotiating prospective changes in these constraints, and for related management of stakeholder expectations?
For any of the above-mentioned classes of estimators, the notion of ‘accuracy’ will center on the conditional distribution of the errors . Depending on specific substantive and methodological interests, the conditioning events of interest may include membership in a specified subgroup of (e.g., the group of smaller enumeration area for a census), restriction to a single finite-population realization of a superpopulation model, specified underlying population conditions (e.g., restrictions on the density or severity of clustering of a minority subpopulation), or properties of certain error components (e.g., constraints on the correlation of missing-data propensities with the outcome variables). Subject to the specified degree of conditioning, analyses of ‘accuracy’ tend to focus primarily on the conditional expectation and variance of , but there also may be interest in other properties of the conditional distribution of , for example, upper- and lower-tail quantiles and related robust measures of dispersion like the interquartile range.
In addition, assessment of the ‘accuracy’ dimension of data quality often requires in-depth evaluation of the specific sources of uncertainty that may dominate the conditional mean and variance of . For sample surveys, the resulting ‘total survey error’ (TSE) literature (e.g., Anderson et al., 1979, Groves & Lyberg, 2010, and references cited therein) has tended to identify some sources (e.g., sampling variability, measurement error, and some incomplete-data effects) that can be quantified quite readily and has noted other sources (e.g., some forms of variable-specific and population-undercoverage effects) for which quantification may be more problematic, and may be limited to sensitivity analyses. In addition, several areas of recent work (e.g., Biemer, 2019, and the edited volume by Biemer et al., 2017) have explored extensions of the TSE literature to cases involving nonsurvey data sources; the resulting ‘components of total uncertainty’ approach has a direct bearing on estimators computed from population census data, or from the integration of multiple data sources. For any of these cases, if one or more data sources are subject to disclosure protection based on ‘added noise,’ then it is of interest to assess the relative contribution of ‘added noise’ to the overall ‘total uncertainty’ term. In addition, Karr (2017) considered TSE aspects of several disclosure protection approaches.
Related Questions for Work With Disclosure Limitation: Accuracy of Point Estimators
For discussion of the effect of added noise in disclosure protection work in a given setting, the abovementioned framework leads to the following questions on assessment of the ‘accuracy’ dimension of data quality.
Q5: What is the set of estimands and estimators, of principal interest? What are the primary substantive and methodological reasons for the emphasis on this set? To what extent do the proposed estimators include adjustments intended to account for the addition of fully public information on disclosure protection uncertainty?
Q6: In assessing the impact of disclosure protection uncertainty on the statistical properties of , what are the conditional-distribution properties (e.g., conditional bias, variance, or quantiles) of greatest practical interest to data users and what are the particular conditioning events that are of greatest interest? Also, what are the primary substantive and methodological reasons for the emphasis on these specific conditional-distribution properties, and on the specified conditioning events?
Q7: To what extent can one produce an empirical decomposition of the ‘components of total uncertainty’ of the predominant sources of bias and random variability of , under realistic approximations and related conditions?
Q8: Based on the decomposition referenced in Q7, what is the specific impact of the disclosure protection component of uncertainty, relative to other components like coverage error, reporting error, incomplete data patterns, and other sources of variability that occur commonly in the collection of data for production of official statistics?
Q9: What are practical methods to summarize and communicate the results from Q5 to Q8 across the full set , including identification of, for example, overall patterns and the degree of heterogeneity of results across , across different choices of the privacy-budget parameter , or across different environmental or statistical-design conditions?
Inferential statements based on published estimates generally require related measures of uncertainty. For example, for univariate estimands , construction of test statistics and confidence sets often will center on the nominal pivotal quantity and related reference distributions, where the standard error term is the square root of the applicable variance estimator.
Most commonly, one uses functions of variance estimators that are computed through linearization or replication methods; see, for example, Wolter (2007). However, the properties of these variance estimators depend on the structure of the underlying components of uncertainty (as induced, for example, through sample design, incomplete-data patterns, disclosure protection, and measurement error). Consequently, in-depth study may be required to determine whether a given variance estimator accounts for all of the relevant sources of uncertainty (including the uncertainty from disclosure protection) in published decennial census estimates.
Also, for cases in which the cardinality of is not small (e.g., for 50-plus states and related entities, dozens of counties within a state, or hundreds of enumeration districts), analysts often have interest in extensive exploratory analyses, for example, ranking counties from largest to smallest based on the values of , fitting of models intended for subsequent use in small domain estimation, performing related data-driven clustering of geographical or demographic areas, or drawing inferences regarding underlying causal phenomena. For some general methodological background on these issues, see, for example, Shen and Louis (1998), Rao and Molina (2015), and Imbens and Rubin (2015). It is broadly recognized that for cases involving extensive exploratory analyses, both analysts and users of the resulting statistical information should direct careful attention to the transparency, reproducibility, and replicability of the underlying procedures. For some related discussion, see, for example, NASEM (2017b, 2019) and Stodden et al. (2014). This same general idea applies to work with data subject to disclosure protections.
Related Questions for Work With Disclosure Protection: Quality of Inferential Procedures
Q10: In the computation of standard errors and related inferential statistics (e.g., confidence intervals), what methods are used to account for the effects of the predominant sources of ‘total uncertainty,’ including the effects of the uncertainty due to disclosure protection algorithms? Under specified conditions, what is known about the bias and stability properties of the associated variance estimators for , distribution of -statistics and related pivotal quantities, and the coverage rates and distribution of widths of related confidence intervals?
Q11: What is the effect of a given disclosure-protection procedure (e.g., formal privacy methods, use of synthetic data, or cell suppression) on the transparency, reproducibility, and replicability of statistical procedures and empirical results related to , including those that involve highly exploratory analyses?
As noted by Brummet et al. (2022), decennial population counts for relatively small groups (e.g., counties, tracts, block groups, or blocks, as well as 1940-era enumeration districts) are often used for the design of samples for subsequent surveys. In addition, such counts are also used for construction of survey weights and as input for small domain estimators. Some authors (e.g., Chang & Kott, 2008; Dever & Valliant, 2016; Kott, 2006) have previously studied adjustment of survey weights for coverage error, incomplete data, or other errors on population control totals, and it would be of interest to explore extensions of that work to the disclosure protection-uncertainty case considered here. Many survey designs combine data from the decennial census with tabulations from the American Community Survey (ACS; sometimes implicitly by using the population totals from the ACS, which are benchmarked to decennial counts by the population estimates program).
Related Questions for Work With Disclosure Protection: Secondary Use of Data
Q12: What is the impact of disclosure protection uncertainty on the properties of customary statistical procedures that use the information in intermediate steps of the design and implementation of subsequent surveys or other data-collection efforts?
Q13: What are effective methods to adjust standard procedures for sample design, construction of survey weights, small domain estimation, and related measures of uncertainty, to account for the effects of uncertainty arising from disclosure-protection procedures, within the framework defined by Q10 and Q11?
Brackstone (1999, p. 141) discussed the ‘accessibility’ dimension of quality with emphasis on, “the ease with which it can be obtained from the NSO” (national statistical office) and with acknowledgment of the potential role of cost in determining accessibility. Similarly, discussion of quality by the European Statistical System Committee (2019) included: “Principle 15: Accessibility and Clarity. European Statistics are presented in a clear and understandable form, released in a suitable and convenient manner, available and accessible on an impartial basis with supporting metadata and guidance.” In addition, the literature generally acknowledges that for a given body of statistical information, there are degrees of accessibility, which depend on multiple features of the data in question, the prospective access process, and the data users.
Evaluation of the impact of differential privacy procedures, and other disclosure protection methods with known statistical properties (e.g., synthetic data sampled from a public model) on the ‘accessibility’ dimension of data quality involve multiple layers of issues. This is due to the complex ecosystem of data dissemination and use by many intermediaries and data users, the intangible capital required for efficient work by each of these stakeholders, and the ways in which previous dissemination practices may affect perceptions of accessibility and the incremental costs of changes in data access and use procedures. Perceptions and substance of accessibility may also be influenced by expectations and practices related to the public goods features of statistical agency data releases and by evolving open-data and open-government policies. See NASEM (2017a) for some related discussion of customary procedures through which statistical agencies obtain public input regarding the specific sets to place in the public domain and the technical processes through which those agencies seek to ensure that the resulting data are readily discoverable and usable by a wide range of data users.
Related Questions for Work With Disclosure Protection: ‘Accessibility and Clarity’ Dimension of Data Quality
Q14: To what extent, and in what ways, do factors related to differential privacy and other disclosure-protection methodologies have an effect on agency decisions about the sets to place in the public domain? To what extent are those decisions, and related internal deliberation processes, affected by recent or anticipated changes in the underlying methodological, technological, and data environments?
Q15: To what extent, and in what ways, should discovery and search features of agency data dissemination websites incorporate information regarding the effect of disclosure-protection procedures on decisions referenced in Q14 (e.g., the effect of added noise on specific published data, or why the agency is no longer publishing data for certain cells), on the statistical features of the released data (e.g., changes in standard errors), and on related metadata?
Work with differential privacy and other disclosure-protection methods has been motivated in large part by legal and regulatory mandates to reduce or eliminate identity disclosure and attribute disclosure risks that otherwise might be incurred by a respondent to a survey or census, or by a person or entity that provides related data through administrative processes (cf. Abowd et al., 2019). In addition, statistical agencies generally are expected to give careful consideration of multiple dimensions of risk and cost incurred by data sources, by groups that are responsible for data capture, production of estimates , and curation of these estimates as well as the underlying microdata and metadata, by intermediaries that have important roles in further dissemination and curation of data that agencies place in the public domain, and by external analysts and other data users. Relative priorities among multiple dimensions of quality, risk, and cost require careful justification and related communication with oversight organizations, policymakers, and other stakeholders; Section 5.3 provides related comments. In addition, rigorous quantification of these risks and costs may not occur in some cases and limitations on available risk and cost information may have important effects on stakeholder discussions, and acceptance, of disclosure-protection methods.
Related Questions for Work With Disclosure Protection: Risks and Costs
Q16: For each of the stakeholder groups described above (respondents and other data providers, statistical agencies, dissemination intermediaries, and external analysts and other data users), what are the predominant components of risk and cost that may be affected by differential privacy and other disclosure-protection procedures related to the prospective release of ? To what extent can these costs (including fixed and variable components of both direct costs and avoided costs) and risks be qualitatively characterized or quantitatively measured, modeled, and managed? For risk and cost dimensions that reasonably can be quantified, what are the applicable methodologies for information capture and analysis, with emphasis on assessment of the impact of disclosure protection procedures? What are some practical ways in which that assessment can inform the implementation and management of specific disclosure-protection procedures? In addition, what are the best ways (e.g., simple case studies, graphical displays, and related narratives) to communicate the results of disclosure-protection analyses in ways that resonate with a wide range of stakeholders, including both technical specialists and more general data users?
Q17: For disclosure-protection cases in which some predominant risk and cost dimensions cannot reasonably be measured empirically, to what extent can some insights into quality-risk-cost trade-offs be obtained through Bayesian elicitation methods (e.g., Garthwaite, 2005; O’Hagan et al., 2006; and references cited therein), related sensitivity analyses, and studies of cognitive processes related to probability judgments by laypersons (e.g., Dougherty & Hunter, 2003 a, b)?
Q18: Per the final part of Q16, to what extent does the data user incur substantial added cost due to the use of specialized estimators that adjust for ‘added noise’ in disclosure-protected data?
Per the discussion of tiered access in Section 2.3, microdata (unit-level data) from statistical agencies are sometimes available to researchers for analyses that are not covered fully by the publicly released aggregate estimates . In some cases, this access occurs through collaboration with internal agency research personnel. In other cases, access is provided through restricted-access facilities; in the United States, one group of facilities is managed through the Federal Statistical Research Data Center (FSRDC) system, and similar arrangements exist in several other countries. Access through either internal agency collaboration or restricted-access facilities involves formal agreements that limit the data that can be released to the public, for example, estimates for subpopulation-level entries in tables, parameters of regression or other generalized linear models, related hierarchical models, and related standard errors. Historically, some statistical agencies also released public use microdata sets subject to certain restrictions like suppression of personally identifiable information and geographical labels, unit swapping or other perturbations, topcoding of some variables, and use of synthetic data.
It is arguable that increased access to microdata is desirable because it increases the use value or option value of the data for some stakeholders. For example, the use value may be increased for analysts who have interest in specialized estimands that are defined a priori, but are unlikely to be identified as priorities through customary agency stakeholder-outreach procedures or those who have a strong preference for use of specialized estimators (e.g., ones that are especially robust against certain problems with extreme values or data errors) that are not used in standard agency production methods. Similarly, the perceived option value may be increased when a preliminary exploratory analysis leads to follow-up questions that were not anticipated during initial proposals for work with the specified microdata set. For the latter case, some caution is warranted regarding the transparency, reproducibility, and replicability of the resulting statistical results (cf. the questions presented in Q11 for exploratory analysis of the pre-specified group of estimators ).
In addition, access to microdata may allow an analyst to carry out a range of model-evaluation procedures that would not be feasible if access were restricted to and related standard errors. In some cases, the analyst could then include in a public report an evaluation-result summary that would have de minimis impact on the privacy-loss budget or other measures of disclosure risk.
Work with microdata subject to disclosure protection procedures may lead to additional classes of issues related to quality, risk, and cost. For example, accuracy of results may be affected by numerous issues related to unit-level modeling methodology and the impact of disclosure protection on the properties of that methodology; and some aspects of the ‘accessibility and clarity’ of microdata work may depend on availability of unit-level metadata that may also require disclosure protection. In addition, statistical agencies will need to gauge carefully the disclosure risks associated with a given set of procedures for microdata use combined with specified protections and individual researchers may be sensitized to risks arising from editorial processes, for example, perceptions of disclosure-protected data by some reviewers. Similarly, analyses of cost may require evaluation of the fixed and variable components of cost incurred by individual analysts, as well as by statistical agencies that capture the original microdata and by intermediaries who may be involved in the distribution of those data.
Finally, we note that the HDSR 2019 Symposium focused attention on analysis of microdata from the 1940 decennial census (see Sienkiewicz & Hawes, 2019, for general background). Under federal law, these 1940 data are now in the public domain. Although there are some structural differences, the released 1940 data have many features that are comparable to the anticipated 2020 microdata. Consequently, they provide an environment in which a wide range of researchers can apply currently available differential-privacy methodology and computer code and develop and test ideas for further refinements that will be of interest for future censuses and surveys.
Related Questions for Work With Disclosure Protection: Microdata Access and Use
Essentially all of the issues outlined in the preceding sections apply to microdata use, often with further complications. The following additional questions also warrant consideration for microdata cases.
Q19: What are the predominant quality, risk, and cost issues arising from analysis of a single microdata set within a protected environment, where that microdata set is subject to added noise for disclosure protection, or the products of microdata analysis are subject to agency disclosure-protection review before public release?
Q20: What are the additional ‘accuracy’ and disclosure-risk issues encountered with record linkage or entity resolution when one or more of the linked microdata sets are subject to added noise?
Q21: What are realistic ways in which to characterize, measure, model, and manage the incremental value (including both ‘use value’ and ‘option value’) conveyed to a given data user by access to microdata, relative to the value conveyed through access to only for a predetermined set ? To what extent, and in what ways, is that incremental value affected by specific disclosure-protection methods like those described in Q19? In what ways can that incremental-value analysis be aligned with realistic measures of accuracy, risk, and cost arising from a specific disclosure-protection procedure?
Q22: Within the broad classes of methodological questions for disclosure protection that could be addressed though analyses based on the 1940 decennial data, which ones have been addressed in depth by recent workshops and publications? Which ones have not been addressed in depth yet, and what are some realistic approaches to explore them within the environment defined by the 1940 microdata and currently available methodology and computer code? What additional classes of methodological questions for disclosure protection cannot be reasonably addressed in sufficient depth through work with the 1940 microdata? Are there other publicly available microdata sets that could be used for in-depth study of those additional classes of questions?
Any tiered-access system (as described in Section 2.3) will have an impact on data accessibility and may potentially affect other dimensions of data quality, especially through application of ‘safe data’ and ‘safe outputs’ criteria. This leads to the following questions.
Q23: Under what conditions does practical implementation of each of the ‘five safes’ criteria align with, respectively, ‘absolute’ or ‘somewhat flexible’ approaches to constraints as described in Section 3.2? What are objective criteria for satisfying any of the five safes criteria that must be considered absolute in nature? Similarly, for five safes criteria that are considered somewhat flexible, what are objective criteria (including sensitivity and likelihood of disclosure) for assessing trade-offs and are there hard thresholds beyond which a given ‘safe’ criterion cannot be allowed to slip?
Q24: What are realistic ways in which to characterize and measure the likelihood of disclosure risk incurred through the combined effects of a given tiered-access system and to align concepts of sensitivity with those properties? Also, in discussion of likelihood of disclosure, what are the relevant conditioning events to be considered in modeling of that likelihood and to what extent should attention focus on tractable bounds for that likelihood? To what extent can one use that alignment as part of the evaluation of trade-offs among multiple dimensions of data quality and disclosure protection? What are some mathematically tractable ways in which one may embed the methods of differential privacy and privacy-budget allocation into this broader evaluation?
Q25: What are realistic ways in which to adapt answers to Q23 and Q24 for tiered-access cases that involve use of synthetic data, as well as verification servers and validation servers?
Q26: What are realistic ways in which to characterize, measure, model, and manage operational risks (i.e., risks arising from imperfect implementation of a given procedure) arising from one or more components of a tiered-access system?
As noted above, characterization of the ‘accessibility’ dimension of data quality is linked closely with questions of cost, including both directly incurred financial costs and related cognitive and time burdens. Navigation of these cost issues, and related governance of disclosure-protection and tiered-access systems, require management of at least four forms of intangible capital: data, analytic capabilities, internal systems, and stakeholder trust.
(i) Data assets often include multiple data sources provided by multiple stakeholders. In many cases, the data are provided by government agencies and held in public trust, with strong constraints related to confidentiality. In the United States, access policies for these data are currently subject to reconsideration due to the “presumption of access” requirement in the 2018 Foundations of Evidence-Based Policymaking Act (2019). That requirement, and changing societal views regarding disclosure protection, arguably have changed the implicit pricing structure for these data.
(ii) Analytic capability, generally provided in whole or in part by external researchers, is essentially a form of human capital. Its deployment generally is subject to strong expectations regarding transparency of the methodology used and empirical results produced, but with varying practices in the application of those expectations, for example, ‘study registration’ for sets of analyses specified a priori vs. the norms on reporting results from highly exploratory data analyses. Changes in the confidentiality landscape (e.g., through caps on the privacy-loss budget or changes in the use of invariants) affect the allocation and use of that human capital, and the resulting productivity of published results.
(iii) Management systems include technology platforms for data search, discovery, access, analysis, and dissemination, technology and processes for disclosure protection, and the personnel who design, develop, and manage those systems. In some cases, efficiencies of scale may lead to dominance by a small number of such systems, or by a single ‘natural monopoly,’ but such cases lead to broad classes of questions related to governance and balancing of stakeholder interests.
(iv) Appropriate levels of trust and collaboration among multiple stakeholders for (i)-(iii), as well as the general public, aligned closely with the ‘five safes’ approach reviewed in Section 2.3.
Different types of statistical information production and use (per the examples reviewed in Section 2)—and related disclosure-protection procedures—employ these forms of capital in different ways. For a given class of intended production and dissemination tasks characterized by , the optimal (or at least relatively efficient) use of these multiple forms of capital will vary, depending on the relative abundance and cost of those capital sources, the intended production function and its performance profile P, and predominant operating constraints.
Moreover, to varying degrees, each of these four forms of intangible capital are perceived to be public goods. For some general background on the public-goods literature, see, for example, Arrow and Fisher (1974) and Weisbrod (1964). Also, see Groshen (2018), Hess and Ostrom (2006), Hughes-Cromwick and Coronado (2019), Rolland (2017), Summers (2016), and Teoh (1977) and Trivellato (2017) for discussion of statistical information products as public goods and Abowd and Schumtte (2019) for discussion of public-goods aspects of confidentiality protection.
In addition, these forms of capital are generally controlled by multiple sets of stakeholders with different utility functions and operating norms. In addition, each of these stakeholders may have different views of the relative value of each of the four above-mentioned forms of capital, and specific cases thereof.
Practical management of tiered-access systems also can be complicated by three additional factors. First, changes in underlying technologies, and in societal attitudes toward information and privacy, can lead to changes in the relative value of these four forms of capital, the value assigned to outputs from a tiered-access system, and related operating constraints. Within the context of this special issue of HDSR, notable examples of technological change are the database reconstruction theorem and development of the concepts and practices of formal privacy, differential privacy, and tiered access.
Second, to some degree, many tiered-access facilities have some characteristics of a natural monopoly, and thus will require a governance structure that balances the interests of multiple stakeholder groups. The constructive impact of the resulting governance process depends on whether it is aligned in a realistic way that mitigates the dominant structural problems induced by the (potential or actual) monopoly, has efficient mechanisms through which to resolve disagreements among multiple stakeholders, and is appropriately resourced.
Third, the cost-effective and sustainable management of a tiered-access facility must be integrated with other dimensions of the overall production and dissemination activities of a statistical agency. Examples of these dimensions include financial management (especially the capacity to fund both fixed and variable costs on a reliable and adaptable basis), human resource management (especially integration of the efforts of internal and external technical experts), fault-tolerant design of production and monitoring systems, and management of networks of data users and other external stakeholders.
Consequently, the data-quality criterion of ‘accessibility’ is an important part of a broader discussion of the mechanisms and costs related to the access to, and costs of, all of the components of capital required to produce and disseminate a given body of statistical information. Furthermore, the access mechanisms and cost structures naturally are affected by a range of legal and regulatory constraints and related technologies. Those constraints and technologies include those related to privacy protection, for example, differential privacy and other formal privacy methods.
Related Questions for Work With Disclosure Protection and Tiered Access to Data: Management of Multiple Forms of Intangible Capital
Q27: For a given body of data access and use within the tiered-access framework, what is a realistic production function that accounts for the four above-mentioned forms of capital (data assets, analytic capability, management systems and stakeholder trust and collaboration), cash, and other important factors? To what extent, and in what ways, is that production function, and the resulting performance profile, affected by a given set of policies and procedures for disclosure protection and data governance?
Q28: For cases in which disclosure protection is managed by setting an overall privacy-loss budget and allocating that budget among subgroups, what are the ways in which the answers to Q27 vary with changes in the overall budget and its allocation?
Q29: To what extent, and in what ways, do the answers to Q27 and Q28 vary as one considers work with (a) a single survey or administrative record data set and (b) integration of multiple datasets originating with different providers?
Q30: In the development of a tiered-access governance system, what are the primary disclosure-protection components that warrant attention? Potential examples include choice of invariants in tabular publications, pre-specification of target estimands , , through ‘study registration’ in microdata analysis, ‘hard’ or ‘soft’ requirements regarding quality factors (especially accuracy, relevance, and granularity) for specified estimators and disclosure-protection procedures, setting overall privacy-loss levels, and allocations thereof, or similarly for priorities in the use of other disclosure protection methods like swapping and cell suppression. Also, what are realistic governance processes that will allow a robust balance among the competing priorities identified by multiple stakeholders?
Q31: In the development and implementation of procedures to address the interests of multiple stakeholders, what are the strengths and limitations of ‘governance models’ developed previously for: (a) internal self-regulation by an operating unit, which is held responsible for outcomes through market forces and the overall legal system; (b) ‘government-owned-contractor-operated’ entities like national nuclear laboratories; or (c) regulated utilities? To what extent do risk-management components (e.g., market or tort-system responses under [a], reporting and approval requirements under [b], or environmental regulations under [c]) suggest potential governance components for disclosure protection under a tiered-access system? For work with disclosure protection in the currently changing environment, to what extent is it of special interest to examine historical experiences with (a)–(c) during periods of rapid change in the technology, stakeholder needs, and resource-allocation mechanisms? Similarly, to what extent do those historical cases offer insights into potential benefits and limitations related to horizontal integration of a data dissemination function across multiple programs and vertical integration of a data dissemination facility within an end-to-end program of data capture, processing, and dissemination? Also, what is the impact of specialized technology (e.g., a specific differential privacy and postprocessing system) on the comparative advantage of the horizontally or vertically integrated options and the intermediary roles that may arise in a tiered-access system? In addition, to what extent do the preceding answers depend on whether the architecture and code for a data dissemination system (including disclosure protection components) are managed on an open-source public-license basis?
Finally, some important governance processes related to disclosure protection will require nuanced measurement and interpretation of public perceptions related to privacy protection and trust in institutions. For general background on measurement of privacy perceptions and trust, see, for example, Bates et al. (2012), Bauer et al. (2019), Fobia et al. (2019), Hunter-Childs (2015), Hunter-Childs et al. (2015, 2019), Pew Research Center (2019), and references cited therein. For application of some of these measures to agency decisions on disclosure protection operations, see Hunter-Childs and Abowd (2019).
This leads to a final set of questions on governance.
Q32: When making decisions about specific features of a disclosure-protection procedure (e.g., setting a privacy-loss budget level, a swap rate, or a cell-suppression rate), what are realistic ways in which researchers and governance organizations can incorporate measures of public opinion? If numerical results from public opinion surveys are used in determining disclosure-protection features, what are some practical ways in which one can account for bias and variance properties of the opinion survey results? Also, what are some ways in which cognitive testing and other analytic methods can be used to fine-tune public opinion surveys on privacy and trust in ways that shed additional light on these governance issues?
For simple publications of tabular data, the ‘relevance’ criterion for data quality tends to focus on the alignment of specific measured variables, and underlying sampled populations, with the corresponding fundamental concepts (e.g., employment, income, education, or health status) and the intended reference population. For more complex analyses involving model development and estimation, the relevance criterion also depends on the extent to which the model aligns with the overall inferential goals of the analysis. For example, in a regression model intended to offer insight into the predictive relationship between an outcome variable and a predictor subvector , after accounting for a second vector of demographic conditions , all within a specified population , the relevance criterion generally would depend on whether one can obtain and link the variables , , and at the specified unit level. This leads to the questions:
Q33: To what extent do specific types of disclosure protection (and related stakeholder communications) lead data providers (e.g., survey respondents or managers of administrative systems) to be more willing to share data that closely address the relevance criteria described above? Also, in a provider-specific variant on Q32, what are realistic methods to assess the willingness of key stakeholders to provide requested data, and to identify policy, empirical, and stakeholder-communication features that may affect that willingness?
Conversely, some data users may express reluctance to use certain types of disclosure-protected data, due to a perception that certain disclosure protection methods reduce the relevance of such data. For example, unconstrained use of ‘noise’ procedures may lead to estimated counts that are negative, or to estimated vectors that do not satisfy additivity constraints (cf. Sexton, 2019). In addition, some analysts are reluctant to use synthetic microdata because they believe that the synthetic data-generating models are not sufficiently relevant to the intended class of analytic models.
Q34: What are realistic methods to assess the perceptions of prospective data users regarding the relevance of disseminated data that have been subject to disclosure protection procedures? Also, to what extent do the resulting assessments offer insight into modifications of disclosure-protection methods, and related stakeholder communication, that would address the predominant ‘relevance’ concerns of key stakeholders?
Data users often seek to receive estimates that are tabulated at a relatively fine level of cross-sectional (as defined by, e.g., geographic, demographic, or economic characteristics) and temporal aggregation. For some general background, see NASEM (2017 b, pp. 116–117). The preference for highly granular data often is closely linked with notions of relevance, especially for cases in which a stakeholder decision (e.g., for a policy intervention or a marketing action) may be targeted within a short time period or a tightly defined cross-sectional group. In addition, efforts to analyze multivariate relationships based on excessively aggregated data can lead to variants on ‘ecological fallacy’ problems.
However, requests for highly granular data can require in-depth exploration of numerous issues with disclosure protection. For example, without the addition of noise or use of other disclosure protection methods, publication of tables with many cells can lead to problems arising from the database reconstruction theorem (Dinur & Nissim, 2003), per discussion by Abowd and Schumtte (2019) and others.
Conversely, the recent CNSTAT workshop on privacy protection for 2020 decennial census data products (https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518) explored the production of estimates at a fine level of cross-sectional granularity. In that discussion, added-noise procedures, in conjunction with postprocessing methods intended to preserve nonnegativity and additivity constraints (sometimes viewed as ‘natural constraints’), were observed in some cases to produce positive biases in areas with small counts, and negative biases in areas with large counts. For in-depth discussion of these issues, see, for example, Sexton (2019), Hotz and Salvo (2020), Abowd, Ashmead et al. (2019), and Ashmead et al. (2019), with special reference to important distinctions between the effects of, respectively ‘added noise’ and postprocessing error, with the latter dominant in the above-mentioned bias issue. Technical details for reduction of these biases through the adjustment of the postprocessing algorithm are of considerable practical interest, but are beyond the scope of the current article. Broader questions about trade-offs between accuracy and granularity include the following.
Q35: For a specified set of use cases for publication at a fine level of granularity, what are realistic ways in which to characterize and quantify the loss of information quality and stakeholder value that would be incurred through each of the following three options designed to satisfy a given privacy-budget constraint through use of added noise:
Tolerate estimator biases: Publication at the requested fine level of granularity, and with the specified additivity and nonnegativity constraints imposed through a given postprocessing algorithm?
Allow violation of some ‘natural’ constraints: Publication at the requested fine level of granularity, but with some or all of the additivity and nonnegativity constraints removed, and the postprocessing algorithm modified accordingly?
Coarsened publications: Publication only at a specific coarser level of granularity, and retaining all of the applicable previous additivity and nonnegativity constraints as implemented in the postprocessing algorithm?
In addition, to what extent do numerical results for the preceding three questions change as one changes the overall privacy-loss budget or the allocation of that budget? Finally, if the results from these questions indicate that none of the proposed options are fully satisfactory, to what extent can those concerns be addressed through use of validation-server methods or by conducting initial granular analyses within a research data center environment?
In general work with data quality, discussion of punctuality generally focuses on the elapsed time between the reference period of a specified data set and the public release date of a given data product. Substantial lags between a reference date and the completion of the analysis may lead to a substantial degradation in the value of the analysis. This can be of special concern in the analysis of high-frequency economic data, as well as some short-duration public-health phenomena.
Also, for some cases of microdata access, researchers are required to submit proposals for specified classes of analyses of a given data set. Conditional on approval of the proposal, the researchers then have data access for a limited amount of time and analysis results are subject to disclosure-protection review and approval prior to public release. Concerns about punctuality can arise for several steps in this process, for example, lags between the reference date for the dataset, and its availability for researcher access, elapsed time between submission and final approval of a research proposal, time required to complete the planned analyses, especially for cases that require the researcher to travel a substantial distance to a ‘bricks and mortar’ analysis facility, and time to complete disclosure-protection review of a specified set of analytic results. The latter case can be of special interest when a researcher seeks public release of a large number of supplementary tables or other analytic results. Also, for analyses intended for publication in refereed journals, researchers may have concern that ‘revise and resubmit’ editorial decisions may require additional microdata work after expiration of the original period of data access.
These factors lead to an additional set of questions about the effect of disclosure protection work on punctuality.
Q36: Which features of a disclosure-protection process have a substantial impact on overall punctuality of a specified part of a tiered-access system? To what extent are specific concerns about punctuality focused on, respectively, the median time for completion of a specified step, or the upper tail of the elapsed-time distribution for that step? For the features with the largest impact on punctuality, what are the primary trade-offs between the magnitude of time lags and design features that are under the control of tiered-access management or individual researchers?
For evaluation of data quality, the ‘coherence’ criterion involves the expectation that reported results are conceptually and operationally unified on both an internal basis (i.e., within a given body of disseminated data, across multiple groups of estimands, and, if applicable, across multiple reference periods) and an external basis (i.e., across different studies in a given field). Due to natural variation in population conditions and statistical methods across populations and across time, perfect coherence may not be a realistic expectation. However, it is important for a statistical organization to make serious efforts to maintain a reasonable degree of internal and external coherence for its data products and to communicate clearly with stakeholders regarding related shortfalls. Also, stakeholder preferences for a high degree of coherence and comparability over time can impose substantial constraints on a statistical organization that is seeking to produce improvements in other dimensions of data quality (e.g., accuracy or relevance), cost management, or disclosure protection. In the context of disclosure protection, these considerations lead to an additional group of questions.
Q37: What are realistic ways in which to gauge whether adoption of a given disclosure-protection methodology will lead to substantial issues with coherence? In other words, if we have two candidate sets of estimators that have the same target estimands and are conceptually coherent and comparable, but use two distinct forms of disclosure protection, how do we assess the resulting limits on the practical coherence and comparability of the two sets of numerical results? For example, do changes in the use of invariants lead to patterns of cell publication (and nonpublication) that do not match previous publication patterns for the same production program or for peer production programs? Similarly, per the discussion in Section 7, do changes in added-noise and postprocessing algorithms lead to substantial changes in bias or mean-squared-error patterns for a given set of estimands? Also, for stakeholders who are especially interested in preservation of temporal coherence of published series, to what extent do changes in disclosure protection methodology lead to breaks in series; or to distortions in reported growth curves, seasonality patterns, or other trajectories of change?
In the discussion of the ‘interpretability’ dimension of data quality, Brackstone (1999) and others have emphasized the importance of metadata related to a specific data set or statistical production program. Such metadata can help stakeholders interpret specific results (e.g., by understanding variable definitions, data collection and analysis processes, and related limitations), and are also essential for numerous goals related to transparency, reproducibility, and replicability. For some general background on standards for statistical data and metadata, see, for example, United Nations Economic Commission for Europe (2020) and references cited therein.
Some preceding questions (e.g., Q15) touched on metadata topics. Within the context of tiered access and disclosure protection, questions on the ‘interpretability’ dimension of data quality also include the following.
Q38: To what extent does a given tiered-access system (or specific system component) provide sufficient metadata for stakeholders to understand: the general disclosure protection procedures used for a specific data release, specific tuning parameters (e.g., privacy-budget levels, public-data invariants, swap rates, or cell-suppression rules, where applicable), the properties (including both disclosure-protection properties and data-quality properties) of those procedures under customary operating conditions, and the protections of statements regarding those procedures when the customary conditions are not satisfied? In addition, for the disclosure-protection procedure(s) used, have the system architecture and code base been placed in the public domain in a well-documented open source form that can be readily used and adapted by current and future researchers; and do the metadata include links to that information? Also, for tiered-access facilities that disseminate data from multiple programs, what are realistic expectations for the degree of standardization of metadata from those programs?
This note has outlined some concepts that may be useful as statistical agencies seek to explore and implement disclosure-protection methodology in additional depth. The questions presented in Sections 3–10 are intended to provide some traction as we seek to characterize, measure, model, and manage the space defined by multiple dimensions of quality, risk, cost, stakeholder value, and related constraints that are affected by decisions related to disclosure protection. We also hope that these questions may help to provide perspective in the exploration of methodological and policy issues that arise in specific case studies, for example, those considered in this special issue of the Harvard Data Science Review and in the December 11–12, 2019 Committee on National Statistics workshop on disclosure protection.
In that exploration, we would like to understand as much as possible about the features of that space that have a practical impact on our stakeholders for work in the current environment with, for example, differential privacy for the 2020 Census. We also seek to understand as much as possible regarding priorities for future work within a dynamic environment defined by changing social, economic, technological, methodological, and data-space factors. This leads to additional questions:
Q39: For areas related to Q1–Q38, where do we have solid empirical information to inform practical stakeholder discussion and agency decisions, either for special cases or for relatively general cases? Similarly, in which areas is the applicable methodology and technology relatively mature, and in which areas would additional investments potentially lead to substantial improvements in the technical base and the resulting quality, risk, and cost profiles? For areas with limited empirical information or underdeveloped technical features, what are the best ways to communicate those limitations to all stakeholders, and subsequently to improve the information base and technical performance? In addition, do the primary stakeholders follow standards for communication so that it is clear, is appropriately tuned to the information base and decision needs of the applicable audience(s), includes important nuances and cautionary notes, and avoids sweeping statements that are not sufficiently supported by currently known facts?
Finally, Sections 2 and 5 noted that government agencies produce and disseminate statistical information in the form of public goods. However, some disclosure-protection methods (including differential-privacy procedures) are also used for statistical information produced and disseminated by private sector organizations. This leads to a last set of questions:
Q40: In what ways, and to what extent, are some of the results for the preceding questions Q1–Q39 functions of the public-goods mission, norms, and practices of government statistical agencies? When considered from a private-goods or club-goods perspective, what are some notable prospective changes in results for Q1–Q39?
The author thanks the HDSR 2019 Symposium authors for the opportunity to read their papers and presentations on disclosure protection and many thanks colleagues in government statistical agencies, universities, and private-sector organizations in several countries for very productive discussions of the topics considered in this paper. In addition, the author thanks the Associate Editor and two referees for very thoughtful comments that led to improvement of this paper.
The views expressed here are those of the author and do not represent the policies of the United States Census Bureau.
Abowd, J. M., Ashmead, R., Garfinkel, S., Kiefer, D., Leclerc, P., Machanavajjhala, A., Moran, B., & Sexton, W. (2019). Census TopDown: Differentially private data, incremental schemas and consistency with public knowledge. Research and Methodology Directorate, United States Census Bureau. http://systems.cs.columbia.edu/private-systems-class/papers/Abowd2019Census.pdf
Abowd, J. M., & Schmutte, I. M. (2015). Economic analysis and statistical disclosure limitation. Brookings Papers on Economic Activity. https://www.brookings.edu/wp-content/uploads/2015/03/AbowdText.pdf
Abowd, J. M., & Schmutte, I. M. (2019). An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review, 109(1), 171–202. https://doi.org/10.1257/aer.20170627
Abowd, J. M., Schmutte, I. M., Sexton, W., & Vilhuber, L. (2019). Why the economics profession must actively participate in the privacy protection debate. American Economic Association: Papers and Proceedings, 109(May), 397–402, https://doi.org/10.1257/pandp.20191106
Andersen, R., Kasper, J., & Frankel, M. R. (1979). Total survey error. Jossey-Bass.
Arrow, K., & Fisher, A. C. (1974). Environmental preservation, uncertainty and irreversibility. The Quarterly Journal of Economics, 88(2), 312–319. https://doi.org/10.2307/1883074
Ashmead, R., Kiefer, D., LeClerc, P., Machanavajjhala, A., & Sexton, W. (2019). Effective privacy after adjusting for invariants with applications to the 2020 Census. GitHub. https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0941_Effective_Privacy_after_Adjusting_for_Constraints__With_applications_to_the_2020_Census.pdf
Asquith, B., Hershbein, B., Kugler, T., Reed, S., Ruggles, S., Schroeder, J., Yesiltepe, S., & Van Riper, D. (2022). Assessing the impact of differential privacy on measures of population and racial residential segregation. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.5cd8024e
Barrientos, A. F., Bolton, A., Balmat, T., Reiter, J. P., de Figueiredo, J. M., Machanavajjhala, A., Chen, Y., Kneifel, C., & DeLong, M. (2018), Providing access to confidential research data through synthesis and verification: An application to data on employees of the U. S. federal government, Annals of Applied Statistics, 12(2), 1124–1156. http://doi.org/10.1214/18-AOAS1194
Bates, N., Wroblewski, M. J., & Pascale, J. (2012). Public attitudes toward the use of administrative records in the U.S. Census: Does question frame matter? Proceedings of the 2012 FCSM Conference, Washington, DC, January 10–12. National Center for Education Statistics. https://nces.ed.gov/FCSM/pdf/Wroblewski_2012FCSM_III-A.pdf
Bauer, P. C., Keusch, F., & Kreuter, F. (2019). Trust and cooperative behavior: Evidence from the realm of data-sharing. PLOS One, 14(8), Article e0220115. https://doi.org/10.1371/journal.pone.0220115
Biemer, P. P. (2019). Total error frameworks for integrating probability and nonprobability data. Presentation to the ITACOSM, June 6. https://meetings3.sis-statistica.org/index.php/ITACOSM2019/ITACOSM2019/paper/view/1654
Biemer, P. P., de Leeuw, E., Eckman, S., Edwards, B., Kreuter, F., Lyberg, L. E., Tucker, N. C., & West, B. T. (Eds.) (2017). Total survey error in practice. Wiley. https://doi.org/10.1002/9781119041702
boyd, d. (2020, May 8). Balancing data utility and confidentiality in the 2020 U.S. Census. Data & Society. https://datasociety.net/library/balancing-data-utility-and-confidentiality-in-the-2020-us-census/
Brackstone, G. (1999). Managing data quality in a statistical agency. Survey Methodology, 25(2), 139–149. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1999002/article/4877-eng.pdf?st=04h4icOd
Brummet, Q., Mulrow, E., & Wolter, K. (2022). The effect of differentially private noise injection on sampling efficiency and funding allocations: Evidence from the 1940 Census. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.a93d96fa
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in nonlinear models: A modern perspective (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781420010138
Chang, T., & Kott, P. S. (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95(3), 557–571. https://doi.org/10.1093/biomet/asn022
Clark, C. Z. F. (2020). COPAFS-Hosted Tiered Access Workshops. Presentation to the Council of Professional Associations on Federal Statistics, March 6. Washington DC. https://copafs.org/wp-content/uploads/2020/03/CLARK-COPAFS-hosted-Tiered-Access-Workshops-rev.pdf
Desai, T., Ritchie, F., & Welpion, R. (2016). Five safes: Designing data access for research. Economics Working Paper 1601. University of the West of England. https://www2.uwe.ac.uk/faculties/BBS/Documents/1601.pdf
Dever, J. A., & Valliant, R. (2016). General regression estimation adjusted for undercoverage and estimated control totals. Journal of Survey Statistics and Methodology, 4(3), 289–318. https://doi.org/10.1093/jssam/smw001
Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. In F. Neven & C. Beeri (Eds.), Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’03 (pp. 202–210). ACM. https://doi.org/10.1145/773153.773173
Dougherty, M. R. P., & Hunter, J. E. (2003a). Hypothesis generation, probability judgment and individual differences in working memory capacity. Acta Psychologica, 113(3), 263–282. https://doi.org/10.1016/S0001-6918(03)00033-7
Dougherty, M. R. P., & Hunter, J. E. (2003b). Probability judgment and subadditivity: The role of working memory capacity and constraining retrieval. Memory and Cognition, 31(6), 968–982. https://doi.org/10.3758/BF03196449
Eltinge, J. L. (2014). Discussion of, “A System for Managing the Quality of Official Statistics” by Biemer et al. Journal of Official Statistics, 30(3), 431–435. https://doi.org/10.2478/jos-2014-0024
European Statistical System Committee. (2019). Quality Assurance Framework, version 2.0. https://ec.europa.eu/eurostat/documents/64157/4392716/ESS-QAF-V1-2final.pdf/bbf5970c-1adf-46c8-afc3-58ce177a0646
Fobia, A. C., Holzberg, J., Eggleston, C., Hunter-Childs, J., Marlar, J., & Morales, G. (2019). Attitudes towards data linkage for evidence-based policymaking. Public Opinion Quarterly, 83(S1), 264–279. https://doi.org/10.1093/poq/nfz008
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/115/plaws/publ435/PLAW-115publ435.pdf
Fuller, W. A. (1987). Measurement error models. Wiley.
Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. Journal of Official Statistics, 9(2), 383–406. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/masking-procedures-for-microdata-disclosure-limitation.pdf
Garthwaite, P. H., Kadane, J. B., & O'Hagan, A. (2005). Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100(470), 680–701. https://doi.org/10.1198/016214505000000105
Gong, R. (2019). Exact inference with approximate computation for differentially private data via perturbations. arXiv. https://doi.org/10.48550/arXiv.1909.12237
Gong, R., & Meng, X.-L. (2020). Congenial Differential Privacy under Mandated Disclosure. In J. Wing & D. Madigan (Eds.), FODS '20: Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference (pp. 59–70). ACM. https://doi.org/10.1145/3412815.3416892
Gonzalez, J. M., & Eltinge, J. (2016). Discussion of, “On a Modular Approach to the Design of Integrated Social Surveys” by Ioannidis et al. Journal of Official Statistics, 32(2), 295–300. https://doi.org/10.1515/jos-2016-0016
Groshen, E. L. (2018). Views on advanced economy price and wage-setting from a reformed central bank researcher and national statistician. Proceedings of the Conference on Price and Wage-Setting in Advanced Economies, ECB Forum on Central Banking (pp. 267–283). European Central Bank. https://www.ecb.europa.eu/pub/pdf/sintra/ecb.forumcentbank201810.en.pdf
Groves, R. M., & Lyberg, L. (2010). Total survey error: Past, present and future. Public Opinion Quarterly, 74(5), 849–879. https://doi.org/10.1093/poq/nfq065
Hess, C., & Ostrom, E. (Eds.). (2006). Understanding knowledge as a commons: from theory to practice. MIT Press.
Hotz, V. J., & Salvo, J. (2020). Assessing the use of differential privacy for the 2020 Census: Summary of what we learned from the CNSTAT Workshop. American Statistical Association. https://www.amstat.org/asa/files/pdfs/POL-CNSTAT_CensusDP_WorkshopLessonsLearnedSummary.pdf
Hughes-Cromwick, E., & Coronado, J. (2019). The value of U.S. government data to U.S. business decisions. Journal of Economic Perspectives, 33(1), 131–146. https://doi.org/10.1257/jep.33.1.131
Hunter-Childs, J. (2015). Public attitudes towards the use of administrative records to supplement the 2020 U.S. Census. Presentation to the Workshop on Alternative and Multiple Data Sources, Committee on National Statistics, National Academies of Science, Engineering and Medicine, Washington, DC, December 16. https://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_171489.pdf
Hunter-Childs, J., & Abowd, J. M. (2019). Update on confidentiality and disclosure avoidance. Presentation to the U.S. Census Bureau National Advisory Committee, November 8. https://www2.census.gov/cac/nac/meetings/2019-11/childs-abowd-update-on-confidentiality-disclosure-avoidance.pdf
Hunter-Childs, J., King, R., & Fobia, A. C. (2015). Confidence in U. S. Federal Statistical Agencies. Survey Practice, 8(5). https://doi.org/10.29115/SP-2015-0024
Hunter-Childs, J., Fobia, A. C., King, R., & Morales, G. (2019). Trust and credibility in the U.S. Federal Statistical System. Survey Methods: Insights from the Field. Survey Insights: Methods from the Field. https://surveyinsights.org/?p=10663
Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social and biomedical sciences: An introduction. Cambridge University Press. https://doi.org/10.1017/CBO9781139025751
Karr, A. F. (2017). The role of statistical disclosure limitation in total survey error. In P. P. Biemer, E. de Leeuw, S. Eckman, B. Edwards, F. Kreuter, L. E. Lybert, N. Clyde Tucker, & B. T. West (Eds.), Total survey error in practice (chap. 4). Wiley. http://doi.org/10.1002/9781119041702.ch4
Kearns, M., & Roth, A. (2019). The ethical algorithm: The science of socially aware algorithm design. Oxford University Press. https://global.oup.com/academic/product/the-ethical-algorithm-9780190948207?cc=us&lang=en&
Kott, P. S. (2006). Using calibration weighting to adjust for nonresponse and coverage errors. Survey Methodology, 32(2), 133–142. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2006002/article/9547-eng.pdf?st=na5CBksN
Loder, E., Groves, T., & MacAuley, D. (2010). Registration of observational studies: The next step towards research transparency. British Medical Journal, 340, 375–376. https://doi.org/10.1136/bmj.c950
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox and the 2016 U.S. presidential election. Annals of Applied Statistics, 12(2), 685–726. http://doi.org/10.1214/18-AOAS1161SF
Merrill, R. A. (1988). Implementation of the Delaney Clause: Repudiation of congressional choice or reasoned adaptation to scientific progress? Yale Journal on Regulation, 5(1), 1–88. https://digitalcommons.law.yale.edu/cgi/viewcontent.cgi?article=1062&context=yjreg
Merrill, R. A. (1997). Food safety regulation: Reforming the Delaney Clause. Annual Review of Public Health, 18, 313–340. https://doi.org/10.1146/annurev.publhealth.18.1.313
National Academies of Sciences, Engineering, and Medicine. (2017a). Principles and practices for a federal statistical agency (6th ed.). National Academies Press. https://doi.org/10.17226/24810
National Academies of Sciences, Engineering, and Medicine. (2017b). Federal statistics, multiple data sources, and privacy protection: Next steps. National Academies Press. https://doi.org/10.17226/24893
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press. https://doi.org/10.17226/25303
National Research Council. (1982). Diet, nutrition and cancer. Report from the Committee on Diet, Nutrition and Cancer of the National Research Council. National Academies Press. https://www.ncbi.nlm.nih.gov/books/NBK216644/pdf/Bookshelf_NBK216644.pdf
Nissim, K., & Wood, A. (2018). Is privacy privacy? Philosophical Transactions of the Royal Society: Series A, 376(2128). https://doi.org/10.1098/rsta.2017.0358
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P.H., Jenkinson, D. J., Oakley, J. E., & Rakow, T. (2006). Uncertain judgements: Eliciting experts' probabilities. Wiley. https://doi.org/10.1002/0470033312
Olken, B. A. (2015). Promises and perils of pre-analysis plans. Journal of Economic Perspectives, 29(3), 61–80. https://doi.org/10.1257/jep.29.3.61
Pew Research Center. (2019, November 15). Americans and privacy: Concerned, confused and feeling lack of control over their personal information. https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-confused-and-feeling-lack-of-control-over-their-personal-information/
Rao, J. N. K., & Molina, I. (2015). Small area estimation (2nd ed.). Wiley.
Reiter J. P., Oganian A., & Karr, A. F. (2009). Verification servers: Enabling analysts to assess the quality of inferences from public use data. Computational Statistics and Data Analysis, 53(4), 1475–1482. https://doi.org/10.1016/j.csda.2008.10.006
Rolland, A. (2017). The concept and commodity of official statistics. Statistical Journal of the IAOS, 33(2), 373–385. https://doi.org/10.3233/SJI-160289
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Sexton, W. (2019). Day 2 follow-up. Presentation to the Committee on National Statistics Workshop, Washington, DC, December 12. https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518
Shen, W., & Louis, T. A. (1998). Triple-goal estimates in two-stage hierarchical models. Journal of the Royal Statistical Society: Series B, 60(2), 455–471. https://doi.org/10.1111/1467-9868.00135
Sienkiewicz, R., & Hawes, M. B. (2019). Background on differential privacy at the U.S. Census Bureau and 1940 census application. Paper presented at the Harvard Data Science Review Conference on Differential Privacy and the 2020 Census, October 25, Cambridge, MA.
Statistics Canada. (2020). Statistics Canada’s Confidentiality Classification Tool. https://open.canada.ca/ckan/en/dataset/2c910c37-c684-561e-9e0b-1d5bb6ca5fb9
Stodden, V., Leisch, F., & Peng, R. D. (2014). Implementing reproducible research. CRC Press.
Summers, L. (2016). The future of price statistics. http://larrysummers.com/2016/04/01/world-bank-price-stats/
Teoh, S. H. (1997). Information disclosure and voluntary contributions to public goods. RAND Journal of Economics, 28(3), 385–406. https://doi.org/10.2307/2556021
Trivellato, U. (2017). Microdata for social sciences and policy evaluation as a public good. IZA Discussion Papers, No. 11092. Institute of Labor Economics (IZA). https://www.econstor.eu/bitstream/10419/174002/1/dp11092.pdf
United Kingdom Data Service. (2020). Data access policy. https://www.ukdataservice.ac.uk/get-data/data-access-policy
United Nations Economic Commission for Europe. (2020). Standards and metadata. https://www.unece.org/stats/mos/stand.html
Vilhuber, L. (2020). Reproducibility and replicability in economics. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.4f6b9e67
Vilhuber, L., Abowd, J. M., & Reiter, J. P. (2016). Synthetic establishment microdata around the world. Statistical Journal of the IAOS, 32(1), 65–68. https://doi.org/10.3233/SJI-160964
Weisbrod, B. A. (1964). Collective-consumption services of individual-consumption goods. The Quarterly Journal of Economics, 78(3), 471–477. https://doi.org/10.2307/1879478
Wolter, K. M. (2007). Introduction to variance estimation (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-35099-8
No rights reserved. This work was authored as part of the Contributor’s official duties as an Employee of the United States Government and is therefore a work of the United States Government. In accordance with 17 U.S.C. 105, no copyright protection is available for such works under U.S. law.