Data Flush

Data perturbation is a technique for generating synthetic data by adding “noise” to raw data, which has an array of applications in science and engineering, primarily in data security and privacy. One challenge for data perturbation is that it usually produces synthetic data resulting in information loss at the expense of privacy protection. The information loss, in turn, renders the accuracy loss for any statistical or machine learning method based on the synthetic data, weakening downstream analysis and deteriorating in machine learning. In this article, we introduce and advocate a fundamental principle of data perturbation, which requires the preservation of the distribution of raw data. To achieve this, we propose a new scheme, named data flush, which ascertains the validity of the downstream analysis and maintains the predictive accuracy of a learning task. It perturbs data nonlinearly while accommodating the requirement of strict privacy protection, for instance, differential privacy. We highlight multiple facets of data flush through examples.


Introduction
Data perturbation gives rise to synthetic data by adding noise to raw data, which has had vast applications since the pioneering work of Breiman on estimating the prediction error in regression (Breiman, 1992).In the data privacy domain, data perturbation can ensure a prescribed level of privacy protection by imposing a suitable noise level (Day One Staff, 2018;Dwork, 2006;Erlingsson et al., 2014;Kaissis et al., 2020;Santos-Lozada et al., 2020;Venkatramanan et al., 2021).In statistics and data science, data perturbation is an effective tool for replicating a sample, for example, developing Monte Carlo methods of model selection (Breiman, 1992;Shen & Ye, 2002).In this situation, data perturbation generates synthetic data to resemble raw data in terms of distribution.Despite its great potential in many domain sciences, the data science community underappreciates the data perturbation technique.
In the differential privacy literature, data perturbation privatizes raw data to satisfy the requirement of ε-differential privacy (Dwork, 2006;Dwork, McSherry, et al., 2006), for example, by the Laplace method (Dwork, McSherry, et al., 2006;Dwork & Roth, 2014).Data perturbation can also mask sensitive classification rules in data mining (Delis et al., 2010).One major challenge for privacy protection is that most privatization methods suffer from information loss in a privatization process to satisfy a prescribed level of privacy protection (Gong & Meng, 2020;Goroff, 2015;Santos-Lozada et al., 2020).As a result, privatization weakens downstream statistical analysis and yields unreliable machine learning solutions.One remedy to information loss is to lower the level of protection to trade for reasonably good accuracy of statistical analysis.This common practice refers to as low-error-high-privacy differential privacy in the survey literature (Chen et al., 2016;Reiter, 2019).
In the statistics literature, data perturbation has been utilized for model assessment as in the generalized degrees of freedom (Ye, 1998) and for developing adaptive model selection criteria (Shen & Huang, 2006;Shen & Ye, 2002) and model averaging criteria for nonlinear models (Shen et al., 2004), estimating the generalization error (Shen & Wang, 2006), and performing causal inference (Xue et al., 2021).One challenge here is how to generate synthetic data to validate statistical inference despite the significant progress for statistical prediction.
In many applied sciences, synthetic data must meet task-specific requirements for an end-user.In privacy protection, synthetic data or privatized data must meet some privacy protection standards to guard against disclosure.In statistics, synthetic data replicates a random sample so that users can perform statistical analysis, simulate phenomena and operational behaviors of a real-world process, and train machine learning algorithms.For instance, Candes et al., 2018 uses knockoffs, a special kind of synthetic data, to estimate the Type I error or false discovery error rate in feature selection.In such a situation, one challenge is how to ensure that synthetic data would represent raw data while satisfying task-specific requirements to meet an end user's needs.
To meet the challenges, we first review the data perturbation technique and introduce a scheme of data perturbation, what we call data flush, to guide users to design a perturbation process to validate the downstream analysis and yield reliable solutions.Then, we demonstrate the utility of data flush in two disparate yet intertwined areas: statistical inference and differential privacy.Critically, this scheme enables to satisfy any level of privacy protection for differential privacy while maintaining the statistical accuracy of privatized data as if one used raw data.Finally, we showcase the data-flush scheme in that it can simultaneously satisfy requirements in both differential privacy and statistical inference.
The data-flush scheme is distinctive in three ways.First, it generates multiple perturbed copies of the raw data following a target distribution.Second, it can ensure differential privacy while preserving the target distribution.Third, it applies to nearly all kinds of data, particularly continuous, discrete, mixed, categorical, and multivariate.To the best of our knowledge, Bi andShen, 2021 andWoodcock andBenedetto, 2009 are only methods of preserving a target distribution, where the former satisfies differential privacy while the latter only limits disclosure risk.Furthermore, data flush also maintains its link with the raw data identifier or the user's identification, permitting data integration, data sharing, and personalization.
This article consists of five sections.Section 2 introduces the data-flush scheme and discusses its applicability in differential privacy and statistics.Section 3 develops a pivotal inference method based on data flush, which ascertains the validity of statistical inference.Section 4 applies the data-flush scheme to the 2019 American Community Survey Data to demonstrate its effectiveness in differential privacy protection and contrast statistical inference before and after privatization.Section 5 discusses future directions of data perturbation.The Appendix contains some technical details.

Data flush
This section introduces a fundamental principle of data perturbation, stating that data perturbation must preserve the distribution of raw data to ascertain the validity of the downstream analysis and the reliability of a machine learning solution.Applying this principle, we derive a data perturbation scheme, called data flush, based on a family of nonlinear data perturbations, which simultaneously satisfy the requirements of differential privacy and valid statistical analysis.

Multivariate continuous distributions.
-Given an independent sample (Z 1 , . ..,Z n ) following a p-dimensional continuous distribution F, we apply (2.1) to each component Z i (j) through the probability chain rule, where given Z ij (1) * yields Z ij (2) * as in (2.1), and so forth.A perturbed sample is where ), l = 2, . ..,p, as the first variable in the chain rule has preserved the identifier of raw data.
Discrete and mixed distributions.-Ageneralization of (2.2) to discrete or mixed distributions, including the empirical distribution, is achieved through a smooth version of noncontinuous F, which agrees with F at its jump values, see Bi and Shen, 2021 for more details.Then, (2.2) applies by replacing F with its smooth version.

Key properties and benefits.
Several characteristics of data-flush in (2.2) are worth mentioning.First, Z ij * follows the target distribution R.This distribution-preservation property ensures statistically valid analysis on perturbed data.Second, Z ij * is positively correlated with Z i (1) , as measured by are conditionally independent given U i = (U i (1) , …, U i (p) ); i = 1, . .., n, while in probability; i = 1, . ..,n, j = 1, . ..,m.
The proof is given in the Appendix.

Applications.
2.3.1.Differential privacy.-Thissubsection reviews the application of data perturbation in differential privacy and present the advantages of data flush.Differential privacy becomes the gold standard of privacy protection for publicly released data, for example, census data (Kenny et al., 2021;United States Census Bureau, 2020).Given a prescribed level (i.e., privacy factor) ε > 0 of privacy protection, ε-differential privacy (Dwork, 2006) requires that the alteration of any original data leads to a small change of the released information.
The differential privacy literature focuses on the design of privatization methods satisfying ε-differential privacy.Towards this end, Wasserman and Zhou, 2010 laid the statistical foundation of differential privacy.As noted in Goroff, 2015,Santos-Lozada et al., 2020, and Gong and Meng, 2020, essentially all privatization methods weaken downstream statistical analysis at the expense of achieving a prescribed level of privacy protection, which is referred to as the trade-off between data privacy and usefulness.Moreover, differential privacy usually entails an impractical requirement on raw data, namely, the bounded support of its underlying data distribution (Wasserman & Zhou, 2010).
To alleviate the accuracy loss and the boundedness requirement, scientists attempt to approximately preserve some summary statistics of raw data in a privatization process.Snoke and Slavković, 2018 suggested a privatization method by maximizing a distributional similarity between privatized and raw data.Liu, Vietri, Steinke, et al., 2021 (i.e., PMW) leveraged public data as prior knowledge to improve differentially private query release, and Liu, Vietri, and Wu, 2021 (i.e., GEM) developed an iterative method to approximately preserve the answers to a large number of queries for discrete data.Boedihardjo et al., 2021 improved the statistical accuracy of the Laplacian method by estimating the distribution of raw data.However, none of these methods preserved the probability distribution of raw data, although they intend to retain some summary statistics such as the distributional similarity and answers of queries.Furthermore, GEM focused on a weaker version of ε-differential privacy, known as (ε, δ)-differential privacy (Dwork, Kenthapadi, et al., 2006), where δ denotes the probability of information being leaked.
Despite the progress, information loss for downstream statistical analysis prevails for most privatization methods.Preservation of summary statistics may be inadequate as an evaluation metric requires the knowledge of the data distribution for statistical analysis or a machine-learning task.For example, GEM suffers from a loss of statistical accuracy even if it intends to preserve the discrete distribution of multi-way interactions.As illustrated in Table 1, GEM not only renders a significant amount of accuracy loss in terms of predictive performance and parameter estimation in regression analysis but also requires excessive computation to achieve privatization.In contrast, the data-flush scheme (2.2) maintains high statistical accuracy due to distribution preservation, which has greater data usefulness for downstream analysis.More simulation details are provided in the Appendix.
Data flush adds suitable noise to guarantee a prescribed level of privacy protection while applying a nonlinear transformation to preserve a target distribution to validate the downstream analysis and provide reliable solutions.For example, one can adopt a version of (2.2) with noise e ij following a Laplace(0, 1/ε) distribution to guarantee ε-differential privacy (Bi & Shen, 2021), and a smoothed empirical CDF to approximates the original data distribution.However, the empirical CDF has to be built upon an independent sample to satisfy the definition of ε-differential privacy.Public data from similar studies can serve as the independent sample, such as past American Community Survey data for the current American Community Survey or Census.As an alternative, one can also consider a holdout sample, which is a random subset of the raw data (Bi & Shen, 2021).In this situation, the holdout sample is fixed once selected.Any alteration, query, or release of the holdout sample is not permissible.This guarantees the strict privacy protection of individuals in the holdout sample.In this sense, differential privacy does not apply to the holdout sample, since query and alteration as required by the definition of differential privacy are not allowed.

2.3.2.
Inference.-Thissubsection briefly comments on data flush as a tool for statistical inference.A crucial aspect of data flush is its capability of recovering the exact distribution of a pivotal quantity in the finite sample regime, as shown in Theorem 1.In contrast, a resampling method such as bootstrap (Efron, 1992; R. J. Tibshirani & Efron, 1993) approximates the distribution of a pivotal via a Monte Carlo method, which can not recover the exact distribution in the finite sample regime.Moreover, data flush has the great potential to treat the issue of the bias in inference after model selection, as demonstrated in Section 3. In contrast, standard bootstrap suffers from the difficulty of discontinuities of an estimate (Efron, 2004).

Other applications.
-Data flush has applications in other areas.

Model sensitivity.:
To quantify the impact of model selection on estimation, Ye, 1998, Shen and Ye, 2002, and Shen and Wang, 2006 define the generalized degrees of freedom using the notion of model sensitivity through a linear perturbation form Z i * = Z i + ε i with ε i ~ N(0, ε 2 ) for a Gaussian sample (Z 1 , . ..,Z n ).Data flush provides a means of evaluating the model sensitivity for various data.
Data integration and personalization.: Data-flush in (2.2) retains a positive rank correlation between perturbed and raw observations for the first component ), as suggested by Lemma 1.This first component serves as a data identifier for data integration and personalization.In privacy protection, for instance, privatized data is released for one S , where Y is the sample mean and S is the sample standard deviation.Here, we apply the data-flush inference scheme to simulate the distribution of perturbed pivotal T * and compare it with the bootstrapped pivotal (Efron, 1992) and the exact distribution of T. To generate perturbed samples for inference, we apply (2.1) with e ij following a Laplace(0, 1/ε) distribution with ε = 0.01 and R being the CDF of N Y , S 2 given Z.
Figure 1 reveals one salient aspect of data flush: It renders a nearly identical distribution of T, whereas nonparametric bootstrap differs substantially for a small sample size n = 5.
In other words, nonparametric bootstrap's approximation accuracy depends highly on the sample size n.Indeed, data flush yields an exact distribution of a pivotal as the Monte-Carlo size D → ∞.This observation agrees with the result of Theorem 1.
High-dimensional regression.-Oursecond example focuses on the construction of a confidence interval in linear regression on a vector of p predictors: where p could be substantially larger than the sample size n, β = (β 1 , . ..,β p ) is a vector of regression coefficients, X i = (X i1 , . ..,X ip ) ~ N(0,Σ) is a vector of predictors that are independent of the error ε i , and the (j, k)-th element of the covariance matrix Σ is ρ |j−k| , and σ 2 is an unknown error variance.Our goal is to construct a confidence interval for an individual coefficient β l with other covariates involving model selection.
In a high-dimensional situation, one often applies the method of regularization for dimension reduction.As a result of the inherent bias from regularization, a standard method needs debiasing and uses an asymptotic distribution of debiased LASSO estimate (Zhang & Zhang, 2014) with L 1 -penalty (R. Tibshirani, 1996) given a prespecified regularization parameter.Alternatively, one may invert a constrained likelihood ratio test with the L 0constraint (Zhu et al., 2020).Yet, the inherent bias due to regularization persists in the finite sample regime even after debiasing.
To construct a confidence interval for parameter β l , we apply the constrained L 0 -norm regression (Shen et al., 2012) to select variables excluding variable X l while treating other regression parameters as a nuisance, where the truncated L 1 -penalty function (TLP) constraint approximates the L 0 -constraint for computation.Towards this end, we apply the data-flush Monte-Carlo inference method based on (2.1) for a confidence interval to generate synthetic samples to estimate the distribution of an asymptotic pivotal quantity T = (β l − β l )/SE(β l ) (Zhu et al., 2020), where SE(β l ) is the standard error of the constrained L 0 -norm regression (CTLP) estimate β l .
To replicates X i , Y i i = 1 n for inference, we apply (2.1), where e ij is independently sampled from the Laplace(0, 1/ε) distribution and ε = 0.01,Then, Y ij * = μ X i + ε ij * satisfies εdifferential privacy for any j, where ε ij * = R −1 G U i + e ij in (2.1) and μ X i = ∑ l = 1 p β l X il and σ 2 are the fitted value and the standard estimate of σ 2 based on a holdout sample that is independent of the inference sample, R is the CDF of N 0, σ 2 , and G is the CDF of U i +e ij with U i following the Uniform[0, 1] distribution.
We perform simulations with the true parameters β 1 = β 2 = β 3 = 1 and β j = 0 otherwise, with σ = 0.5 and ρ = 0.5.Then, we apply (2.1) with m = D/n and D = 10p to construct a 95% confidence interval for each β j based on CTLP.The results for β 1 and β 4 are representative and are presented.Specifically, we use the glmtlp package in R to compute the CTLP estimate β l and the default σ 2 there.
Table 2 shows that the empirical coverage probability for β 1 and β 4 are close to the nominal level 95% in each scenario.The discrepancy between the empirical converge and its target 95% is because the asymptotic pivotal may suffer from the bias in the finite-sample situation.Overall, the data-flush Monte-Carlo inference scheme yields a credible confidence interval for a non-smooth problem involving model selection.

American Community Survey data analysis
This section applies the data-flush scheme (2.2) to the 2019 American Community Survey (ACS) Data.Notice that, the existing literature in privacy has not thoroughly depicted low-error-high-privacy differentially private methods for complex sample surveys such as the ACS (Reiter, 2019).We show that data generated by data flush is valid for statistical inference while simultaneously guaranteeing differential privacy.In particular, we demonstrate that confidence intervals constructed upon perturbed copies of raw data are close to those on perturbed copies of privatized data.In other words, the data-flush scheme can simultaneously achieve two disparate objectives: differential privacy and statistical inference.
The American Community Survey collects demographic data from 3.24 million persons nation-wide, roughly 1% of the population in the Year 2019 (Ruggles et al., 2021).Statistical analysis of survey data has a long history.Muralidhar and Sarathy, 2003 provided a theoretical basis for data perturbation with a definition of disclosure risk requirement.Raghunathan et al., 2003 andReiter, 2005 proposed to use multiple imputation to limit the disclosure risk of microdata.Woodcock and Benedetto, 2009 applied a transformation to maximize data utility while minimizing incremental disclosure risk.Jiang et al., 2021 proposed a perturbation method with a masking component to preserve inferential conclusions such as confidence intervals.While most of the above methods aim at limiting the data disclosure risk, they are not designed for differential privacy and are not able to preserve distributions for most data types.
Alternatively, an investigator can apply data flush to privatize survey data like ACS data without incurring information loss when the data-flush scheme preserves the distribution of raw data.For the ACS dataset, we use (2.2) for privatization while applying the data-flush Monte-Carlo inference method to both the raw and privatized data.For an illustration, we make a pairwise comparison of two confidence intervals before and after privatization for coefficients of weighted regression.
In particular, we investigate the impact of privatization by (2.2) on the statistical accuracy of regression analysis of the total personal income on 16 covariates, including an individual's age (AGE), geographical region (REGION), the population of the residential metro/micro area (METPOP10, the logarithm of METPOP10 to be used), metropolitan status (METRO), mortgage status (MORTGAGE), sex (SEX), marital status (MARST), race (RACE), ethnicity (HISPAN), ability to speak English (SPEAKING), health insurance coverage (HCOVANY), educational attainment (EDUCD), employment status (EMPSTAT), occupation (OCC), migration status (MIGRATE1), and veteran status (VETSTAT).For our analysis, we select individuals with a positive total pre-tax personal income from all sources during the 12 months precedent to the survey.This preprocessing renders a sample of 2,389,971 individuals.See the Appendix for more specific details regarding preprocessing.The data types, as well as the number of levels for nominal variables, are summarized in Table 3.Then, we regress the logarithm of total personal income on these 16 covariates using the person weight (PERWT) as the weights for regression.A confidence interval (CI) for each regression coefficient is constructed accordingly before and after privatization.
To satisfy ε-differential privacy, we apply (2.2) with e ij following a Laplace(0, 17/ε) distribution to preserve the joint distribution of 16 covariates and 1 response variable across common data types.In this fashion, privatization protects each individual's information.To illustrate this point, we scrutinize the histogram of the variable AGE before and after privatization in Figure 2, which suggests that little distributional difference is evident.Note that the two histograms before and after privatization are nearly identical, with the mean (standard deviation) being 50.80(19.17)and 50.82(19.17),respectively.Moreover, we randomly choose two categorical variables, namely employment status (EMPSTAT) and migration status (MIGRATE1), to examine the joint distribution before and after privatization, which are the 13th and 15th variables out of 17 variables in the sequential privatization process through (2.2).As suggested by Table 4, the data flush scheme preserves the joint distribution quite well after privatization, particularly for the two-way associations, except for one cell (States-abroad, Non-labor) with small counts.In conclusion, the distribution preservation property of data flush ascertains the validity of downstream statistical inference while protecting data privacy.
We apply the data-flush Monte-Carlo method to construct confidence intervals for raw and privatized data.In particular, for each replication, we only perturb the linear regression residuals and follow the high-dimensional regression example in Section 3. As indicated by Figure 3, the data-flush scheme (2.2) preserves the target distribution of raw data and hence yields nearly identical confidence intervals except for several ones with shifting centers.
Privacy loss usually occurs for high-dimensional data, which is an inherent challenge for any method in differential privacy.In particular, to maintain the same accuracy level, the overall level of privacy protection for each variable tends to decay as the number of variables increases.In our situation, the overall level of privacy protection, defined by the privacy factor ε, is 1 for ε-differential privacy, which requires a stricter level of privacy protection 1/17 for each of the 17 variables.It is equivalent to that each variable requires independent Laplace(0, 17/ε), where the noise variance greatly exceeds the ranges of many variables in the ACS data, especially for binary dummy variables.

Discussion
Data perturbation has its great potential as an effective tool for replicating a sample, which can apply to data security, statistical inference, data integration, among others.The fundamental principle, distribution preservation for data perturbation, that we described in this article allows users to design data perturbation schemes, such as data flush, to satisfy task-specific requirements, as we showcase for statistical inference with differential private data in Section 4. On this ground, synthetic data generated by such a scheme yields statistically valid analysis and high predictive accuracy of a machine learning task.
Several future directions of research include a more flexible model-based estimation (e.g., one including both parametric and empirical components) for high-dimensional target distributions and a compatible data perturbation scheme, as well as generalizations to independent but non-identically distributed data, time-series data, and unstructured data.Histogram of the AGE variable in the ACS data before and after privatization.Private Poisson regression with a privacy factor ε = 1 using raw data, data privatized by data-flush in (2.2), and data privatized by GEM (Liu, Vietri, & Wu, 2021).Kullback-Leibler divergence (KL) and root mean square error (RMSE) for regression coefficients (with the standard error in parenthesis), together with privatization time (Time, in seconds) are presented based on 200 replications.Here σ is the standard deviation of each covariate before discretization (a step required by GEM), and NA indicates that an algorithm fails to converge within two days.Joint distribution between employment status (EMPSTAT) and migration status (MIGRATE1) before and after privatization, where each cell in the contingency table indicates the number of individuals in the release sample before (after) privatization.For MIGRATE1, "House", "State", and "States-Abroad" indicate staying in the same house, moving within a state, and moving between states or abroad; for EMPSTAT, "Employed", "NA/Unemployed", and "Non-labor" mean that an individual is employed, unemployed or not applicable, and not in the labor force, respectively.

Figure 1 .
Figure 1.Illustration of the exact distribution of pivotal for three sample sizes n = 5, 10, 20 based on simulated data.Pivotal's densities for data flush with a Monte Carlo size 10 5 , nonparametric bootstrap with a bootstrap size 10 5 , and the t-distribution on n − 1 degrees of freedom are represented by solid, dot, and dash curves, respectively.

Figure 3 .
Figure 3.Confidence intervals of regression coefficients based on raw data and privatized data, represented by gray and red lines and constructed using the data-flush scheme in Section 3. Regressors from the top to the bottom are the intercept (shifted to the left by 8 units for better visualization), AGE, REGION (8 dummy variables), METPOP10, METRO (2 dummy variables), MORTGAGE (2 dummy variables), SEX, MARST (5 dummy variables), RACE (5 dummy variables), HISPAN, SPEAKENG (2 dummy variables), HCOVANY, EDUCD (6 dummy variables), EMPSTAT (2 dummy variables), OCC (12 dummy variables), MIGRATE (2 dummy variables), and VETSTAT.The confidence intervals based on raw data are comparable with those after privatization in terms of the signs of interval centers and lengths.

Table 3 .
Summary statistics for variables used in the ACS analysis, including variable's names (Name), types (Type), the number of levels for nominal variables (# Level), as well as the mean (Mean) and standard deviation (Standard deviation).Here NA means "Not applicable".
Harv Data Sci Rev. Author manuscript; available in PMC 2023 March 09.