Differentially Private Linear Regression With Linked Data
·
Abstract
There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream preprocessing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this article, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy–accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.
Keywords: differential privacy, record linkage, data integration, privacy-preserving record linkage, gradient descent
Media Summary
Differential privacy is a mathematical framework for ensuring the privacy of individuals in data sets. It mitigates the privacy risk of disclosing sensitive information about individuals within the data set during data analysis. Under such a framework, we are interested in finding the relationship between two variables (via statistical regression) after they are linked from two data sources with uncertainties. A preprocessing procedure of linking data sets is called record linkage, and the uncertainties should be taken into account in the downstream analysis. In the article, we propose two algorithms that satisfy differential privacy for regression estimation problems with linked data. The theoretical results regarding privacy guarantees and statistical accuracy are provided. We demonstrate the performance of the proposed algorithms through simulations and an application.
1. Introduction
Data for the same group of entities are often scattered across different resources, lacking unique identifiers for perfect linkage. To conduct statistical modeling or inference on the integrated information, it is necessary to probabilistically link multiple data sets by comparing the common quasi-identifiers (e.g., names, gender, address) as a preprocessing step. Such a procedure is called record linkage (RL), also known as entity resolution, or data matching (Christen, 2012), which is an essential component of data integration in big data analytics (Dong & Srivastava, 2015). Thanks to its wide application in many disciplines such as public health and official statistics, record linkage has been studied for decades. Earlier pioneering works include Fellegi and Sunter (1969), Jaro (1989), Newcombe et al. (1959). In addition, record linkage is frequently used in current practice. The U.S. Census Bureau has a long tradition using record linkage methodology for multiple endeavors. A current prominent example is the Decennial Census (U.S. Census Bureau, 2022). In this context, record linkage involves using administrative records and other data sources to improve data quality, with efforts underway to construct a comprehensive ‘reference database’ including individuals from multiple administrative records. A recent review paper, Binette and Steorts (2022), provided a comprehensive summary of record linkage. Broadly speaking, there are two perspectives regarding record linkage (Salvati et al., 2021): (1) the primary viewpoint concerns how to link the records; (2) the secondary perspective is focused on how to propagate the uncertainty to the downstream statistical learning tasks after the linkage has been determined. Our focus in this article will adopt the second of these two perspectives.
Closely related to record linkage is data privacy. In the area of privacy-preserving record linkage (PPRL), two or more private data sets owned by different organizations are linked without revealing the data to one another (Christen et al., 2020; Hall & Fienberg, 2010). The outcome of PPRL is the information regarding which pairs or sets are matched. PPRL, in turn, is associated with secure multiparty computation (SMPC) in that SMPC techniques are commonly used to solve PPRL problems (He et al., 2017; Kuzu et al., 2013; Rao et al., 2019). PPRL to date only engages in the private linkage process from a primary perspective, without concerning how the linkage uncertainties would impact the downstream analysis. On the contrary, the secondary perspective is to modify the statistical tools to account for the linkage uncertainty. Our goal is to incorporate privacy into the secondary perspective of record linkage, which is different from, yet complementary to PPRL or SMPC.
Privacy concerns have, if anything, become significantly more exacerbated with the emergence of individual-level big data. Releasing information about a sensitive data set is subject to a variety of privacy attacks (Dwork et al., 2017). Therefore, there has been a growing demand for establishing robust privacy-preserving methodologies for modern statistics and machine learning. A mathematical framework proposed by Dwork et al. (2006), differential privacy (DP), is now considered the gold standard for rigorous privacy protection and has made its way to broad application in industry (Apple Differential Privacy Team, 2017; Bind, 2020; Google, 2021) and the public sector (U.S. Census Bureau, 2021). The literature on differential privacy has been flourishing in recent years and the interface of differential privacy and statistics has started to draw increasing attention from the statistics community.
Recent work on differential privacy focuses primarily on individual statistical and machine learning tasks, with nontrivial upstream preprocessing, such as record linkage, typically not incorporated. In this article, we consider the linear regression problem, that is,
y=Xβ+e,e∼N(0,σ2In),(1.1)
but where X and y are observed in two separate data sets. As a result, rather than having X and y in hand, we are instead provided with a pair X and z. Here z is a permutation of y resulting from record linkage performed by an external entity, who also supplies a minimum amount of information about the linkage accuracy. In the regression procedure, we take into account the linkage uncertainty as well as offer differential privacy guarantees. As shown in Figure 1, which depicts the pipeline of the problem we consider, we assume that an external analyst conducts record linkage a priori. From there, we aim to devise a private estimator for the regression coefficients of ultimate interest with the help of differential privacy.
Specifically, we propose two algorithms for linear regression after record linkage to meet differential privacy: (1) post-RL noisy gradient descent (NGD), and (2) post-RL sufficient statistics perturbation (SSP). Our work builds on the seminal work by Lahiri and Larsen (2005) where an estimator is proposed for linear regression with linked data in a non-privacy-aware setting. We construct a private estimator, β^priv, by deploying differential privacy tools to achieve privacy protections. To the best of our knowledge, our work is the first one in the literature to consider a statistical model after record linkage in a privacy-aware setting.
The two proposed algorithms also extend the noisy gradient method (Bassily et al., 2014) and the “Analyze Gauss” algorithm (Dwork et al., 2014), which are applied to linear regression, to additionally handle the presence of linkage errors. Prior works (Alabi et al., 2020; Bernstein & Sheldon, 2019; Cai et al., 2021; Sheffet, 2017; Y.-X. Wang, 2018) on differentially private linear regression do not consider possible record linkage preprocessing. If the data are linked beforehand, directly applying their algorithms to the imperfectly linked data is not ideal. It is well known that overlooking the linkage errors leads to substantial bias even with a high-linkage accuracy (Neter et al., 1965; Scheuren & Winkler, 1993). Figure 2 showcases a toy example of record linkage, where mismatches, if treated as true, change the sign of the slope estimate. Our illustrative application later in the article confirms this, where around 90% of the records are correctly linked, and the estimators ignoring linkage errors end up with large biases.
Accompanying the estimators resulting from our algorithms, we provide mean-squared error bounds under typical regularity assumptions and record linkage schemes. When no linkage errors are present (i.e., a special case in our scenario), our result in Theorem 4.4 improves upon the noisy gradient method proposed in Cai et al. (2021) by using zero-concentrated differential privacy (Bun & Steinke, 2016) to enable tighter bounds on privacy cost (see Lemma 2.3). Additionally, we have presented (approximate) theoretical variances for β^priv resulting from both proposed algorithms. There appear to be very few other works that have addressed the issue of uncertainty. Two that we are aware of are Alabi (2022), who provided confidence bounds for the univariate case, and Sheffet (2017), who provided confidence intervals dependent on differential privacy noise. Our work focuses on the multivariate case and appears to be the first to directly work on exact variances rather than relying on bounds.
The remainder of this article is organized as follows. Section 2 provides preliminaries on linear regression with linked data and differential privacy. We propose our two algorithms in Section 3 and present the relevant theoretical results in Section 4. In Section 5, we conduct a series of simulation studies and an application to synthetic data. Section 6 concludes and discusses future work. Complete proofs of all theorems can be found in the supplementary materials.
2. Preliminaries
In this section, we review the background results of linear regression after record linkage upon which we build our work, and fundamental concepts from differential privacy. Related work on linear regression with linked data and record linkage with privacy awareness are discussed.
2.1. Linear Regression with Record Linkage
Let (X,ΦX) and (y,Φy) be two data sets that refer to the same group of n entities, with unknown one-to-one correspondence. The quasi-identifiers ΦX and Φy are used to perform the linkage procedure. Let (X,z) be the linked data where z is a permutation of y. Consider the following model for z:
P(zi=yj)=qij,i,j=1,…,n,(2.1)
then ∑j=1nqij=1 for all i and ∑i=1nqij=1 for all j. Thus, qii is the probability of the ith record being linked correctly. Let Q=(qij), which we call the matching probability matrix (MPM), a doubly stochastic matrix. The matrix Q can be estimated, for example, through bootstrapping (Chipperfield, 2020; Chipperfield & Chambers, 2015). In some cases, estimating Q can require inference on only a single parameter (e.g., in the exchangeable linkage error (ELE) model described in Section 2.1.1).
For the fixed-design homoskedastic linear model (1.1), when inference is done after record linkage based on (X,z), Lahiri and Larsen (2005) proposed an unbiased estimator
β^RL=(W⊤W)−1W⊤z,(2.2)
where W=QX. Let wi be the i-th row vector of W, then wi=∑j=1nqijxj. Note that E(zi)=wi⊤β, where the expectation is taken over both linkage uncertainties and y. Transforming X into W offers bias correction for regression estimation after record linkage.
In addition, the variance of β^RL is given by
ΣRL=defVar(β^RL)=(W⊤W)−1W⊤ΣzW(W⊤W)−1,(2.3)
where Σz=defVar(z). Lahiri and Larsen (2005) provide the following characterization of the first two moments of z.
Lemma 2.1. (Theorem A.1, Lahiri and Larsen, 2005). Under the model described by (1.1) and (2.1), we have for i,j=1,…,n
E(zi)=wi⊤β;
Var(zi)=σ2+β⊤Aiβ with Ai=∑j=1nqij(xj−wi)(xj−wi)⊤;
Cov(zi,zj)=β⊤Aijβ with Aij=∑u=1n∑v=unqiuqjv(xi−wu)(xj−wv)⊤.
Note that Σz involves the true coefficients β and Σz=σ2Id+h(β,Q,X) where h(β,Q,X) is a function of β,Q,X as elaborated in Lemma 1. Compared to the covariance of y, Σz has an additional component h(β,Q,X) due to the uncertainty of record linkage.
2.1.1. Structural Schemes of MPM
The matching probability matrix (MPM) Q is generally assumed to have a simple structure. Two schemes used commonly in the literature are as follows.
Blocking Scheme. It is assumed that the MPM is a block diagonal matrix, which means the true matches only happen within blocks. Blocking significantly reduces the number of pairs for comparison and allows scalable record linkage. This scheme is used in almost all real-world applications, and different methods for blocking have been developed (Christen, 2012; Christophides et al., 2020; Steorts et al., 2014).
Exchangeable Linkage Errors (ELE) Model. The ELE model (Chambers, 2009) assumes homogeneous linkage accuracy and errors:
P(correct linkage)=qii=γ,P(incorrect linkage)=qij=n−11−γ for i=j.(2.4)
The ELE model has been adopted in recent works, such as Salvati et al. (2021) and Chambers et al. (2023), for various estimation problems. Even though (2.4) may oversimplify the reality, it is a representative model for a secondary analyst who has minimum information about the linkage quality. When blocking is used, the homogeneous linkage accuracy assumption is imposed within individual blocks. In other words, it still allows heterogeneous linkage accuracy between blocks.
2.2. Differential Privacy
Let X be some data space, and D,D′∈Xn be two neighboring data sets of size n which only differ in one record. Such a relation is denoted by D∼D′.
Definition 1 ((ϵ,δ)-DP (Dwork & Roth, 2014)). For ϵ>0,δ≥0, a randomized algorithm A: Xn→R is (ϵ,δ)-differentially private if, for all D∼D′∈Xn and any O⊆R,
P(A(D)∈O)≤eϵ⋅P(A(D′)∈O)+δ.(2.5)
The expression (2.5) controls the distance between the output distributions on two neighboring data sets through the privacy budget ϵ and δ. Intuitively, differential privacy ensures that D is not distinguishable from D′ based on the outputs. Thus, ϵ should be small enough for the privacy level to be meaningful. Typically, ϵ∈(10−3,10) and δ=o(1/n).
Differential privacy enjoys the following properties that facilitate the construction of differentially private algorithms.
Proposition 2.1 (Basic composition [Dwork & Roth, 2014]). If f1 is (ϵ1,δ1)-DP and f2 is (ϵ2,δ2)-DP, then f:=(f1,f2) is (ϵ1+ϵ2,δ1+δ2)-DP.
Proposition 2.2 (Postprocessing [Dwork & Roth, 2014]). If f is (ϵ,δ)-DP, for any deterministic mapping g that takes f(D) as an input, then g(f(D)) is (ϵ,δ)-DP.
Generally, a differentially private algorithm is constructed by adding random noise from a certain structured distribution, such as the Laplace or Gaussian distributions. A notion central to the amount of noise we add is the sensitivity of the estimation function we desire to release privately.
Definition 2 (ℓ2-sensitivity). Let f:Xn→Rd be an algorithm. The ℓ2-sensitivity of f is defined as
Δf=D∼D′∈Xnmax∥f(D)−f(D′)∥2.(2.6)
The sensitivity of a function characterizes how much the output would change if one record in the data set changes. To achieve (ϵ,δ)-DP, the amount of noise we need depends on both the budget and the sensitivity. The Gaussian mechanism is a canonical example that will be employed herein, which does just that.
Lemma 2.2 (Gaussian mechanism [Dwork & Roth, 2014]). Let 0<ϵ<1 and δ>0. For an algorithm f on the data set D, the Gaussian mechanism A(⋅) defined as
A(D):=f(D)+u,(2.7)
where u∼N(0,2ln(1.25/δ)(Δf/ϵ)2), is (ϵ,δ)-DP.
Combining the basic composition rule and the Gaussian mechanism, for a sequence of functions (f1,f2,…,fT), let
ut∼N(0,ϵ22T2Δt2ln(1.25T/δ)),
where Δt is the ℓ2-sensitivity of ft. Then, A:=(f1+u1,f2+u2,…,ft+uT) satisfies (ϵ,δ)-DP. However, as T increases, this construction tends to add more noise than necessary due to the loose composition. Instead, we could utilize zCDP, another variant of DP, to achieve tighter composition for (ϵ,δ)-DP. The following Lemma essentially captures the results from Bun and Steinke (2016), formulated for our purposes.
Lemma 3 (Better composition for (ϵ,δ)-DP via zCDP). Let ϵ>0,δ>0. For a sequence of functions (f1,f2,…,fT), let
ut∼N(0,2ρTΔt2),(2.8)
with ρ:=ϵ+2ln(1/δ)−2(ϵ+ln(1/δ))ln(1/δ). Then, the randomized algorithm A:=(f1+u1,f2+u2,…,fT+uT) satisfies (ϵ,δ)-DP. If ϵ≤2+28ln(1/δ), it suffices to have
ut∼N(0,ϵ24TΔt2ln(1/δ)).(2.9)
Please refer to the supplementary materials for details. Since, in most practical budget settings, we have ϵ≤2+28ln(1/δ), we will apply (2.9) for composition and analysis in the rest of the article, acknowledging that (2.8) is valid for all parameter ranges.
In Section 3, we shall employ Lemmas 2.2 and 2.3 in devising two distinct algorithms for linear regression after record linkage.
2.3. Related Work
Linear regression with linked data is a fundamental statistical task that has been explored in various articles. Scheuren and Winkler (1993) initially considered the linkage model (2.1) for linear regression and proposed an estimator that is not generally unbiased. Later, Lahiri and Larsen (2005) introduced an exactly unbiased ordinary least squares (OLS)–like estimator given in (2.2) with an expression for the variance, which outperformed the approach by Scheuren and Winkler (1993). Besides, Chambers (2009) and Zhang and Tuoto (2021) offered a few other estimators. According to their simulation studies, some of the estimators provided performance that was at most similar, but not noticeably better, compared to the one proposed by Lahiri and Larsen (2005). Yet, Zhang and Tuoto (2021) relaxed the condition by not assuming that the probability of correct linkage, qii in the model (2.1), can be obtained or estimated. For more extensive reviews of this literature, Z. Wang et al. (2022) gave an account of the recent development of various methods on regression analysis with linked data sets. Chambers et al. (2023) reviewed current research on robust regression of linked data.
On the other hand, there is ongoing research on privacy-preserving record linkage in the field of computer science. PPRL aims to privately link multiple sensitive data sets held by different organizations when they are unwilling or not permitted to share their data with external parties due to privacy and confidentiality concerns. To achieve privacy protection, techniques such as SMPC and DP are combined with machine learning and deep learning methods for conducting PPRL (Christen et al., 2020; Gkoulalas-Divanis et al., 2021; Ranbaduge et al., 2022). PPRL primarily concerns data leakage during the linkage process and produces a linked data set that can be used for further analysis, yet most applications treat the linked data as if there were no linkage errors. Neither the uncertainty propagation nor the private release of the downstream analysis is considered within the scope of PPRL.
Note that there are several articles on privacy-preserving analysis on vertically partitioned databases. In these databases, the attributes are distributed among multiple parties, but common unique identifiers exist to facilitate data linkage across the different parties. Unlike probabilistic record linkage, vertically partitioned databases do not involve linkage errors. Du et al. (2004), Gascón et al. (2017), Hall et al., (2011), and Sanil et al. (2004) discussed the implementations of privacy-preserving linear regression protocols that prevent data disclosure across organizations, whereas Dwork and Nissim (2004) considered data mining from the perspective of the private release of statistical querying in a spirit similar to our work.
3. Differentially Private Algorithms
The unbiased and simply structured estimator provided in (2.2) with a known closed-form variance makes it a suitable prototype to construct our private estimators. We introduce two differentially private algorithms in the following, based on (1) noisy gradient descent and (2) sufficient statistics perturbation. As the names suggest, we mitigate privacy risk by perturbing either the gradient or sufficient statistics during the computation of the linear model. Hereafter, if not specified otherwise, ∥⋅∥ denotes the 2-norm.
3.1. Post-RL Noisy Gradient Descent
Gradient descent methods are ubiquitous in scientific computing for numerous optimization problems. Within the framework of differential privacy, Bassily et al. (2014) provided a noisy variant of the classic gradient descent algorithm. It was later adapted by Cai et al. (2021) to solve the classic linear regression problem with faster convergence. Leveraging the work by Bassily et al. (2014) and Cai et al. (2021), we tailor the noisy gradient method for the post-RL linear regression model for (X,z) based on (1.1) and (2.1).
Let Ln(β)=def2n1(z−Wβ)⊤(z−Wβ) be the loss function, where recall W=QX. The minimizer of Ln(β) is the nonprivate RL estimator proposed by Lahiri and Larsen (2005). Let ΠR(r) denote the projection of r∈Rs onto the ℓ2 ball {r∈Rs:∥r∥≤R}. The post-RL NGD algorithm is defined as follows.
Input: Linked data set (X,z) and matching probability matrix Q, privacy budget (ϵ,δ), noise scale factor B, step size η, number of iterations T, truncation level R, feasibility C, initial value β0. 1: Let W=QX. 2: for t=0 to T−1do 3: Generate ut∼N(0,ω2Id) where ω=nϵ2ηBTln(1/δ). 4: Compute βt+1=ΠC(βt−nη∑i=1n(wi⊤βt−ΠR(zi))wi+ut).(3.1) 5: end for Output: β^priv=βT.
Algorithm 1 is a modified version of the projected gradient descent that incorporates (1) post-RL transformation of the design matrix, (2) addition of noise ut at each gradient step, and (3) use of projection ΠR(⋅) on the response variable. The regular parameters, including η, T and C for the projected gradient method, are specified in Theorem 4.4 for the discussion of the accuracy of β^priv. The injection of noise follows Lemma 2.3. The scale of the Gaussian noise ut at step t depends on the privacy budget (ϵ,δ), and the noise scale factor B associated with the sensitivity in the update function (3.1). The purpose of the projection on z is to bound the sensitivity of the gradient. With a proper choice of R that scales up with lnn (specified in Section 4), the projection does not affect the accuracy of the final estimator with high probability.
The major challenge lies in calculating the sensitivity. In the non-RL least squares regression, two neighboring data sets D=(X,y) and D=(X′,y′) differ in a single row, making it straightforward to derive the sensitivity of the gradient of Ln(β)=2n1(y−Xβ)⊤(y−Xβ). Here, in the context of post-RL analysis, we consider two neighboring data sets containing both linking variables and regression variables, denoted as D=(X,ΦX,y,Φy) and D′=(X′,ΦX′,y′,Φy′), which differ in the record of one individual. The change in one row of the quasi-identifiers ΦX and Φy may affect more than one row of the matching probability matrix Q. As a result, the entries of the transformed design matrix W=QX subject to change are not limited to one row as in the non-RL case. Consequently, determining the sensitivity of the gradient of Ln(β)=2n1(z−Wβ)⊤(z−Wβ) becomes nontrivial. This challenge distinguishes our work from Cai et al. (2021). However, we will demonstrate in Section 4 that, under a condition on the structure of Q, the sensitivity can be tracked.
3.2. Post-RL Sufficient Statistics Perturbation
Noise can be injected into the process besides the gradient computation. Since the estimator interacts with the data through its (joint) sufficient statistics, an efficient way is to perturb the sufficient statistics to protect the data. Such a technique, sufficient statistics perturbation, has been used in previous works such as Foulds et al. (2016), Vu and Slavkovic (2009), and Y.-X. Wang (2018). For the nonprivate OLS estimator β^OLS=(X⊤X)−1Xy, to perturb the joint sufficient statistics (X⊤X,Xy), it suffices to add noise to A⊤A where A=(X∣y) is the augmented matrix. Dwork et al. (2014) offered an algorithm, “Analyze Gauss,” to privately release A⊤A. It was later utilized by Sheffet (2017) for private linear regression, primarily perturbing the sufficient statistics.
In our work, we adapt the “Analyze Gauss” algorithm to linear regression after record linkage, as shown in Algorithm 2. The noise scale factor B is the sensitivity of A⊤A=def(W⊤Wz⊤WW⊤zz⊤z) which is specified in Section 4. The gram matrix A⊤A exhibits properties that facilitate the computation of its sensitivity. Algorithm 2 illustrates how incorporating the joint sufficient statistics in a comprehensive form facilitates the deployment of differential privacy.
Input: Linked data set (X,z) and matching probability matrix Q, privacy budget (ϵ,δ), noise scale factor B, truncation level R. 1: Let W=QX. 2: Generate a d×d symmetric Gaussian random matrix U whose upper triangle entries (including the diagonal) are sampled i.i.d. from N(0,ω2) where ω=ϵB2ln(1.25/δ). 3: Generate a d-dimensional Gaussian random vector u whose entries are sampled i.i.d. from N(0,ω2). 4: if W⊤W+U is computationally singular then 5: Repeat steps 2∼3. 6: end if Output:β^priv=(W⊤W+U)−1(W⊤z∗+u) where z∗=(ΠR(z1),…,ΠR(zn))⊤.
Remark 3.1. In step 4, by postprocessing, checking for singularity of W⊤W+U consumes no extra privacy budget. In fact, the probability of W⊤W+U being singular decreases exponentially as the sample size increases.
An alternative approach to implementing the SSP method is to add random noise separately to each sufficient statistic. In this approach, the total privacy budget should be divided between X⊤X and Xy for the estimation of linear regression, as proposed by Y.-X. Wang (2018). However, treating the joint statistics as a whole is more economical in terms of budgeting in general. Lin et al. (2024) showed through comparison that splitting the total budget among the components results in introducing larger noise on average. Although adding noise individually to the components of interest allows for the private release of each quantity, it is not part of the goal of the estimation.
4. Theoretical Results
In this section, we provide the theoretical results of the two algorithms introduced in Section 3. The results are threefold: (1) differential privacy guarantees, (2) finite-sample error bounds, and (3) variances of the private estimators. We present each of these along with a discussion of the corresponding conditions as they relate to the main variables in our record linkage model. All proofs for these results can be found in the supplementary materials.
4.1. Privacy Guarantees
The algorithms are designed to achieve certain privacy guarantees, given the corresponding sensitivity, for the post-RL case:
Theorem 4 (Privacy Guarantees). Assume the following boundedness conditions hold:
(A1) There is a constant cx<∞ such that ∥x∥2≤cx.
(A2) Let Q and Q′ be the matching probability matrices (MPMs) resulting from the neighboring datasets D and D′ and let Q∼Q′ denote such a relation. We assume that supQ∼Q′∥Q−Q′∥1≤M for some constant M<∞, where ∥⋅∥1 is the entry-wise 1-norm.
Given the linked data (X,z) and the matching probability matrix Q for the regression problem in (1.1), under Assumptions (A1) and (A2), it follows that
(1) Algorithm 1 satisfies (ϵ,δ)-differential privacy with B=Rcx(M+4)+2Ccx2(M+2),(4.1)
(2) Algorithm 2 satisfies (ϵ,δ)-differential privacy with B=Rcx(M+4)+max{2cx2(M+2),2R2}.(4.2)
Essentially, we assume that the data domain is bounded, which is critical for deriving a finite sensitivity of the target function on the data. (A1) is a standard assumption for a bounded designX. For the linking variables that are generally categorical, there are no analogous definitions of ‘norm’ for numerical vectors. Instead, (A2) is imposed on the MPM since it summarizes all the information of the linking variables in the linkage model we consider. Specifically, we assume that two MPMs produced by two neighboring data sets do not differ much in terms of the entry-wise 1-norm. This assumption characterizes a bounded linkage model.
The rationale of (A2) is supported by typical schemes imposed on the structures of MPM in practice, as reviewed in Section 2.1.1. For example, with the blocking scheme, the size of each block is manageably small (O(1)). When one record is altered, the fluctuation of the MPM is limited to at most two blocks. Additionally, with the ELE model (2.4), as long as the changes to a single record only affect a finite number of records, the linkage accuracy γ changes at most O(1/n). Therefore, we have supQ∼Q′∥Q−Q′∥1=O(1). In general, a robust record linkage approach should not produce two considerably different MPMs from two neighboring data sets. Therefore, it is realistic to assume a bounded linkage model.
The proofs of Theorem 4.1 revolve around calculating the sensitivity of the target function in each algorithm. Besides the upper bounds cx and M discussed above, the sensitivity also depends on the truncation level R on the response. Truncation is commonly used in DP algorithm designs when there are no priori bounds on the relevant quantities (e.g., Abadi et al. [2016]). In Section 4.2, we provide a specific choice of R and present an accuracy statement with high probability.
4.2. Finite-Sample Error Bounds
We study the accuracy of the proposed estimators by deriving the finite-sample error bounds. In the following, we introduce two more assumptions in addition to (A1) and (A2):
(A3) The true parameter β satisfies ∥β∥2≤c0 for some constant 0<c0<∞.
(A4) The minimum and maximum eigenvalues of W⊤W/n satisfy
0<L1<dλmin(nW⊤W)≤dλmax(nW⊤W)<L(4.3)
for some constant 1<L<∞.
Assumption (A4) implies the smoothness and strong convexity of the loss function Ln(β)=2n1(z−Wβ)⊤(z−Wβ), which allows for a fast convergence rate for the gradient descent method in Algorithm 1. On the other hand, for Algorithm 2, note that the term (W⊤W)−1 is a component of sufficient statistics. Assumption (A4) offers a bound on the norm of (W⊤W)−1, which helps derive the error bound of β^priv. Let Assumption (A4’) be (A4) with W replaced by X and the constant L replaced by L′. The larger of L and L′ can be chosen as the constant to satisfy both (A4) and (A4’). Therefore, for convenience, we consider (A4) and (A4’) to be the same assumption. We first obtain the accuracy of the nonprivate estimators, for comparison purposes.
Lemma 4.2. Let β^OLS=βargmin(y−Xβ)⊤(y−Xβ) be the OLS estimator. Then, under (A4), it follows that E∥β^OLS−β∥2=σ2tr(X⊤X)−1=Θ(nσ2d2).
Lemma 4.3. Let β^RL=βargmin(z−Wβ)⊤(z−Wβ) be the nonprivate record linkage estimator, and ΣRL be the covariance matrix of β^RL. Then,
E∥β^RL−β∥2=tr(ΣRL),(4.4)
where ΣRL=(W⊤W)−1W⊤ΣzW(W⊤W)−1.
As a special case, when the linkage is perfect (i.e., Q is an identity matrix), the expected error of β^RL in (4.4) takes the reduced form σ2tr(X⊤X)−1 which is exactly the lower bound obtained by β^OLS. Then, by Lemma 4.2, we know that E∥β^RL−β∥2 is of order at least nσ2d2 under (A4). From a secondary perspective regarding record linkage, it is beyond our scope to study how tr(ΣRL) behaves in general.
For the two proposed algorithms, we present upper bounds of the excess squared error of the private estimators, namely, ∥β^priv−β^RL∥2.
Theorem 4.4 (Post-RL NGD). Given the linked data (X,z) and the matching probability matrix Q for the regression problem in (1.1), set the parameters of Algorithm 1 as follows:
step size η=d/L, number of iterations T=⌈L2ln(c02n)⌉, feasibility C=c0, initialization β0=0;
truncation level R=σ2lnn;
noise scale factor B=Rcx(M+4)+2c0cx2(M+2);
Under Assumptions (A1)-(A4), given δ=o(1/n), with probability at least 1−c1e−c2lnn−e−c3d where c1,c2,c3 are constants (see the proof), it follows that
In both algorithms, the response is projected with a level R=σ2lnn where σ2 is the homoskedastic variance of the random error in linear model (1.1). Let E={ΠR(zi)=zi,∀i∈[n]}, then E is a high-probability event. The error bound is analyzed under E, thus we obtain a statement with high probability.
In the NGD method, the bound consists of two parts on the right-hand side in (4.5). The first error term 1/n results from the convergence rate of gradient descent after T iterations. The second error term is due to the addition of Gaussian noise for privacy and thus involves ϵ,δ. It is worth noting that the choice in theory T=⌈L2ln(c02n)⌉ is, to some extent, conservative to ensure the first error term is O(1/n), which is the same order as E∥β^OLS−β∥2. However, more iterations give rise to larger random noise being added to gradient updates due to a smaller privacy budget per iteration. In practice, a smaller number of iterations may be favored for the tradeoff (see the experiment in Section 5.2), especially when n is not sufficiently large.
For the SSP algorithm, the convergence rate in (4.7) depends on similar variables as in the NGD algorithm. The major difference is that it is controlled by σ4 instead of σ2 due to the sensitivity of the gram matrix A⊤A defined in Section 3.2. However, the SSP method has a faster convergence rate when n is sufficiently large. As a result, the SSP estimator is more susceptible to a large variance of the random error in the response variable, whereas the NGD method is more robust. As we shall see in Section 5, the performance of the two algorithms is different under various scenarios.
Putting together Lemma 4.3 and Theorems 4.4 and 4.5, we obtain a high-probability error bound for each algorithm as follows.
Corollary 4.1. Under the regularity conditions (A1)-(A4),
Although a few works (Alabi, 2022; Sarathy & Vadhan, 2024; Sheffet, 2017) have addressed uncertainty of DP estimators through confidence bounds and intervals, and similarly through the release of private estimates of variance, the exact variance of DP estimators is rarely determined in most cases. Recent work, such as Lin et al. (2024), has explored the exact variance of the private estimators for population proportions that have fairly simple structures. The main barrier to the inspection of variance is that if the noise is injected into the intermediate steps of the estimation process other than the output, then it is difficult to track the variability that noise introduces to the output estimator due to the intricate nature of the algorithm.
The NGD and SSP algorithms are two examples where noise is added in the middle of the estimation process. The operations like function composition and taking the inverse complicate the inspection of the variance of the output estimator β^priv. To address this issue, we investigate the variance of β^priv for the two algorithms by studying the variances of two proxy estimators. The theoretical variances of the proxy estimators can be used to approximate those of β^priv.
Theorem 4.6 (Variance for Post-RL NGD). In Algorithm 1, if we consider the estimator without projections
where Id is the identity matrix of size d, A=defnηW⊤W, B=defnηW, and ω2 is the variance of ut.
Remark 4.3. In the nonprivate case where ω2=0, let T→∞, in which case
Σ→A−1B⊤ΣzBA−1=(W⊤W)−1W⊤ΣzW(W⊤W)−1=ΣRL,
which is exactly the variance of β^RL given in (2.3).
The estimator in Algorithm 1 is a projected variant of (4.11). The use of projection with level C on βt+1 in (3.1) impedes the exact analysis of variance for β^priv. Instead, we provide the variance in (4.12) for the nonprojected estimator as a conservative variance for β^priv. The level of projection, the scale of noise, and the number of iterations together determine how conservative it is. From Remark 4.3, we know that as T increases, the first term in the RHS of (4.12) is getting close to ΣRL. The second term, ω2t=1∑T(Id−A)2t−2, then summarizes the cumulative variability resulting from adding random noise at each iteration. Note that this term does not converge by simply increasing T, due to the fact that a smaller budget leads to larger noise at each iteration.
Theorem 4.7 (Variance for Post-RL SSP). For Algorithm 2, let β^′=β^RL+(W⊤W)−1u−(W⊤W)−1⋅U(β^RL+(W⊤W)−1u), then β^priv−β^′→p0 as n→∞. The variance of β^′ is given by
Σ=ΣRL+ω2(W⊤W)−1(Id+Σ0+Σ1+Σ2)(W⊤W)−1,(4.13)
where ΣRL=Cov(β^RL) and the entries of Σ0, Σ1 and Σ2 are given by
(Σ0)kk=∑i=1dβi2 for k=1,...,d; (Σ0)kl=βkβl for k=l.
(Σ1)kk=∑i=1dΣiiRL for k=1,...,d; (Σ1)kl=ΣklRL for k=l.
(Σ2)kk=∑i=1dΣii′ for k=1,...,d; (Σ2)kl=Σkl′ for k=l, where Σ′=defω2(W⊤W)−2.
Remark 4.4. As shown in the proof of 4.7 (see the supplemental), the proxy estimator β^′ is a first-order approximation for β^priv using Taylor series for the term (I+U(W⊤W)−1)−1, which appears in the decomposition of β^priv.
The variance of β^′ also consists of two parts: the variance of the nonprivate estimator β^RL and the additional variation due to the noise injected for privacy purposes. Given Assumption (A4), we have ∥(W⊤W)−1∥=O(d/n) that appears in Σ1 and Σ2. As n increases, the dominant component of the second term would be ω2(W⊤W)−1(Id+Σ0)(W⊤W)−1.
5. Numerical Results
To evaluate the finite-sample performance of the proposed algorithms, we conduct a series of simulation studies and an application to a synthetic data set that contains real data.
5.1. Simulation Studies
In this section, we conduct simulation studies to assess the performance of the two proposed algorithms for simple linear regression with linked data. The nonprivate OLS estimator and RL estimator β^RL by Lahiri and Larsen (2005) are included as benchmarks. The private, non-RL counterpart methods are also performed in the absence of linkage errors for comparison.
Figure 3. Average ℓ2-error and variance (theoretical versus empirical), with (ϵ,δ)=(1,8.5×10−5), against n, σ, and linkage accuracy, respectively. The ‘RL-NGD' and ‘RL-SSP' algorithms are our proposed post-RL approaches applied to the linked data, compared with the non-RL ‘NGD' and ‘SSP' methods applied to (X,y) (i.e., with no linkage errors). The nonprivate ‘OLS' and ‘RL-OLS' (Lahiri & Larsen, 2005) results are also plotted for benchmarking. The number of iterations for ‘RL-NGD' results fall within the range of (210, 260).
For each simulation, a fixed design matrix X and a matching probability matrix Q are produced and a total of 1,000 repetitions are run over the randomness of both the intrinsic error e∼N(0,σ2In) of the regression model and the noise injected for privacy. Figure 3 displays the ℓ2 relative error and both empirical and theoretical variances for the two settings.
Two sets of simulations are conducted to explore the performance with varying sample size n and σ, the homoskedastic variance of the random error in linear model (1.1). The parameters are set as follows:
ELE linkage model: blockwise linkage accuracy γi characterizing Q, block size ni=25.
Settings 1 and 2: γi∈uniform(0.6,0.9), M=1 in Assumption (A2).
Setting 3: the linkage accuracy γi≡γ which varies from 0.6 to 1, while M scales from 1 to 0.
Setting 1: n varies from 3,000 to 10,000, σ is fixed at 1.
Setting 2: n is fixed at 10,000, σ varies from 0.5 to 1.8.
Setting 3: n is fixed at 10,000, σ is fixed at 1.
privacy budget: ϵ=1, δ=1/n1.1.
In setting 1, where σ is fixed at 1, Figure 3a shows that the errors of all methods decrease with a growing sample size. Due to the linkage errors, the post-RL methods, including β^RL and our two algorithms (denoted as ‘RL-OLS,’ ‘RL-NGD,’ and ‘RL-SSP’ in the figures) run on the linked data (X,z), naturally always yield larger errors than their counterparts run on (X,y) when no linkage has to be done beforehand. In this case, with σ=1, post-RL SSP outperforms post-RL NGD in terms of both accuracy and variance. However, as σ increases, post-RL NGD algorithm starts perform better, as depicted in Figure 3c with varying σ. The error grows linearly for post-RL NGD and quadratically for post-RL SSP, which aligns with the theoretical results on the error bounds presented in Section 4.2. Similar trends are observed for comparison of the non-RL NGD and SSP algorithms. In Figure 3e, where linkage error tends to zero, the post-RL versions of the three estimators approach the corresponding non-RL versions. NGD and SSP methods have strictly larger error than OLS due to the cost of privacy.
Figures 3b, 3c, and 3f illustrate the empirical variances (EMP) against the theoretical variances (THR) of the proxy estimators given in Section 4.3. The theoretical variance of post-RL NGD closely aligns with the empirical variance at the chosen level of projection C. Recall that the theoretical variance would be exact when no projection is applied. Thus, with a lower level of projection on the gradient update, we anticipate it to be conservative. On the other hand, the theoretical variance of post-RL SSP approximates well with moderately large n and small σ. However, in scenarios with small n and/or large σ, our theoretical variance may underestimate the reality due to the approximation’s reliance on a first-order Taylor expansion. Therefore, one can include higher order terms for better approximation. In setting 3, where n and σ are fixed, as the linkage error vanishes, the variance reduces as a result of the smaller DP noise needed.
5.2. Application to Synthetic Data
Due to privacy concerns, pairs of data sets containing personal information, which serve as quasi-identifiers, are typically not made public. We instead synthesize from a pair of generated quasi-identifiers data sets and real data for regression, as in Salvati et al. (2021). For quasi-identifiers, we take advantage of the data sets generated by the Freely Extensible Biomedical Record Linkage (Febrl), which are available in the module RecordLinkage by de Bruin (2022) in Python. The pair of data sets for linkage we use correspond to 5,000 individuals. The domain indicator (state) is used for blocking. The record linkage is performed based on the Jaro-Winkler score (Jaro, 1989) or exact comparison on six quasi-identifiers (names, date of birth, address, etc.). The maximum score is 6 for pairs that have exact alignment. A threshold of 4 is chosen to link the records. For those left unlinked, we assign random links to ensure one-to-one linkage. A unique identifier is available in the data set for verification purposes. The resulting linkage accuracy for the nine blocks are γ=(0.880,0.903,0.918,0.938,0.905,0.875,0.898,0.917,0.898), making the overall accuracy 92.5%. We adopt the ELE model for Q and estimate it using γ.
On the other hand, an anonymous data set for regression comes from the Survey on Household Income and Wealth (SHIW) from the Bank of Italy (2014). The net disposable income and consumption are the explanatory variable X and the response y, respectively. Since the SHIW data set is larger, consisting of 8,151 data points, we drop the outliers and randomly draw 5,000 records and synthesize them with the Febrl data set. Figure 4 depicts the setup of the synthesization process. Using the unique identifier from the Ferlb data set, the regression variables (X,y) are appended to the quasi-identifiers (ΦA,ΦB), resulting in two separate data sets: (ΦA,X) and (ΦB,y). Then, record linkage is performed by comparing ΦA and ΦB to output the linked data (X,z) and the matrix Q.
To apply the proposed DP algorithms to the synthesized data set, we set the (hyper)parameters as follows. The privacy budget is given by (ϵ,δ)=(1,8.5×10−5). The variance of the random error, σ2, is estimated by the mean squared error. The upper bounds in Assumptions (A1)-(A3) are set as: M=1, c0=1, cx=max(X)=2.78. In the NDG method, the projection level C is set to 1.2.
To illustrate the importance of propagating linkage uncertainty when conducting downstream regression, we also apply the non-RL version of NGD and SSP algorithms. We obtain the non-RL regression results by running post-RL NGD and post-RL SSP methods with M set to 0 and without converting X into W. This is equivalent to applying the non-RL methods discussed in Cai et al. (2021) and Sheffet (2017) to the linked set (X,z) as if it were perfectly linked.
Figure 5 displays the boxplots of the estimates of each algorithm. For each algorithm, a total of 1,000 repetitions are done in order to reflect the randomness of the injected noise for privacy purposes. The variables X and y are standardized before conducting simple linear regression. The OLS estimator on (X,y) (dashed line) is plotted for comparison. As can be seen, the DP estimators by running (non-RL) NGD and SSP on (X,z) directly are excessively biased as a consequence of ignoring linkage errors, even when the overall linkage accuracy is as high as 92.5%. Conversely, the results of post-RL NGD and post-RL SSP yields estimates centered around the OLS estimator but with higher variances, attributed to the cost of bias correction. Post-RL NGD is more flexible due to hyperparameter tuning. Additionally, we run the NGD methods for fewer iterations with T=⌈L2ln(c02n)/3⌉, which is one-third of the value recommended by theory. We have found that this approach yields smaller variance while still producing accurate results in finite samples. Therefore, the theoretical number of iterations T=⌈L2ln(c02n)⌉ may be conservative in some circumstances. Moderately reducing T may lead to better results.
6. Discussion
In this article, we propose two differentially private algorithms for linear regression on a linked data set that contains linkage errors, by leveraging the existing work on (1) linear regression after record linkage and (2) differentially private linear regression. Figure 6 displays the connections among the related areas at a high level, including PPRL and SMPC mentioned in Sections 1 and 2.3. Our work is the first one to simultaneously consider the linkage uncertainty propagation and the privatization of the output. It also complements the area of PPRL where the main concern is the data leakage among different parties. However, we do not discuss how to link the records in the first place and thus the security issues of the linkage process are beyond our scope. Instead, we treat record linkage from a secondary perspective: we begin with linked data prepared by an external entity and we have limited information about the linkage quality.
Specifically, we propose two post-RL algorithms based on the noisy gradient descent and sufficient statistics perturbation methods from the DP literature. We provide privacy guarantees and finite-sample error bounds for these algorithms and discuss the variances of the private estimators. Our simulation studies and the application demonstrate the following: (1) the proposed estimators converge as the sample size increases; (2) post-RL linear regression incurs a higher cost than the non-RL counterpart in terms of the privacy–accuracy tradeoff; (3) The NGD method is flexible with hyperparameter tuning and can be applied to more general optimization problems; and (4) SSP is specific to the least-squares problem, offering greater budget efficiency and more accurate results provided that the random error of the regression model is not too large.
There are different directions to extend our work. Note that there may be different scenarios of linking between the two data sets of the same set of entities. Assuming one-to-one linkage, as in our article, is a canonical scenario. Although we do not explore it, we expect that our methods can be extended to other scenarios (e.g., one-to-many linkage) where Q still makes sense. Extra assumptions may be required when determining the relevant sensitivities for privacy purposes.
One can also consider record linkage from a primary perspective. In addition to the traditional Fellegi–Sunter model, Bayesian approaches and machine learning–based methods have gained popularity. The record linkage may take forms other than the matching probability matrix adopted here. Furthermore, when privacy concerns arise during the linkage process involving different parties, PPRL and SMPC protocols become essential. Tackling all the challenges depicted in Figure 6 simultaneously with a single efficient tool is of great practical use and significance. This interdisciplinary challenge requires expertise in both statistics and computer science.
Another important direction is exploring related statistical problems in the post-RL context, with or without privacy constraints. For example, confidence intervals and hypothesis testing are fundamental statistical inference tools. Other potential problems that interest statisticians include high-dimensional linear regression and ridge regression.
Disclosure Statement
The research was supported in part by the U.S. Census Bureau Cooperative Agreement CB20ADR0160001 and Canadian NSERC grant RGPIN-2023-03566. The authors would like to thank Adam Smith (Boston University) for all the helpful discussions and comments.
References
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308–318). Association for Computing Machinery. https://doi.org/10.1145/2976749.2978318
Alabi, D., Smith, A., McMillan, A., Vadhan, S., & Sarathy, J. (2020). Differentially private simple linear regression. ArXiv. https://doi.org/10.48550/arXiv.2007.05157
Alabi, D. G. (2022). The algorithmic foundations of private computational social science [Unpublished doctoral dissertation]. Harvard University Graduate School of Arts and Sciences.
Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (pp. 464–473). https://doi.org/10.1109/FOCS.2014.56
Bernstein, G., & Sheldon, D. (2019). Differentially private Bayesian linear regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 525–535). Curran Associates Inc. https://papers.nips.cc/paper_files/paper/2019/hash/f90f2aca5c640289d0a29417bcb63a37-Abstract.html
Binette, O., & Steorts, R. C. (2022). (Almost) all of entity resolution. Science Advances, 8(12), Article eabi8021. https://doi.org/10.1126/sciadv.abi8021
Bun, M., & Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In M. Hirt & A. Smith (Eds.), Theory of cryptography (pp. 635–658). Springer.
Cai, T. T., Wang, Y., & Zhang, L. (2021). The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5), 2825–2850. https://doi.org/10.1214/21-AOS2058
Chambers, R., Fabrizi, E., Ranalli, M. G., Salvati, N., & Wang, S. (2023). Robust regression using probabilistically linked data. WIREs Computational Statistics, 15(2), Article e1596. https://doi.org/10.1002/wics.1596
Chipperfield, J. (2020). Bootstrap inference using estimating equations and data that are linked with complex probabilistic algorithms. Statistica Neerlandica, 74(2), 96–111. https://doi.org/10.1111/stan.12189
Chipperfield, J. O., & Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics, 31(3), 397–414. https://doi.org/10.1515/jos-2015-0024
Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.
Christen, P., Ranbaduge, T., & Schnell, R. (2020). Linking sensitive data: Methods and techniques for practical privacy-preserving information sharing. Springer Cham. https://doi.org/10.1007/978-3-030-59706-1
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., & Stefanidis, K. (2020). An overview of end-to-end entity resolution for big data. ACM Computing Surveys, 53(6). https://doi.org/10.1145/3418896
Du, W., Han, Y. S., & Chen, S. (2004). Privacy-preserving multivariate statistical analysis: Linear regression and classification. In M. W. Berry, U. Dayal, C. Kamath, & D. B. Skillicorn (Eds.), Proceedings of the fourth SIAM international conference on data mining (pp. 222–233). SIAM. https://doi.org/10.1137/1.9781611972740.21
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006). Our data, ourselves: Privacy via distributed noise generation. In S. Vaudenay (Ed.), Advances in Cryptology - EUROCRYPT 2006 (pp. 486–503). Springer Berlin Heidelberg. https://doi.org/10.1007/11761679_29
Dwork, C., & Nissim, K. (2004). Privacy-preserving datamining on vertically partitioned databases. In M. K. Franklin (Ed.), Advances in Cryptology - CRYPTO 2004, 24th Annual International Cryptology Conference (Vol. 3152, pp. 528–544). Springer. https://dx.doi.org/10.1007/978-3-540-28628-8_32
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4), 211–407. http://dx.doi.org/10.1561/0400000042
Dwork, C., Smith, A. D., Steinke, T., & Ullman, J. (2017). Exposed! A survey of attacks on private data. Annual Review of Statistics and Its Application, 4, 61–84. https://api.semanticscholar.org/CorpusID:26766335
Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014). Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11–20). Association for Computing Machinery. https://doi.org/10.1145/2591796.2591883
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049
Foulds, J., Geumlek, J., Welling, M., & Chaudhuri, K. (2016). On the theory and practice of privacy-preserving Bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (pp. 192–201). AUAI Press.
Gascón, A., Schoppmann, P., Balle, B., Raykova, M., Doerner, J., Zahur, S., & Evans, D. (2017). Privacy-preserving distributed linear regression on high-dimensional data. Proceedings on Privacy Enhancing Technologies, 2017(4), 345–364. https://doi.org/10.1515/popets-2017-0053
Gkoulalas-Divanis, A., Vatsalan, D., Karapiperis, D., & Kantarcioglu, M. (2021). Modern privacy-preserving record linkage techniques: An overview. IEEE Transactions on Information Forensics and Security, 16, 4966–4987. https://doi.org/10.1109/TIFS.2021.3114026
Hall, R., & Fienberg, S. E. (2010). Privacy-preserving record linkage. In J. Domingo-Ferrer & E. Magkos (Eds.), Privacy in statistical databases (pp. 269–283). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15838-4_24
Hall, R., Fienberg, S. E., & Nardi, Y. (2011). Secure multiple linear regression based on homomorphic encryption. Journal of Official Statistics, 27, 669–691.
He, X., Machanavajjhala, A., Flynn, C., & Srivastava, D. (2017). Composing differential privacy and secure computation: A case study on scaling private record linkage. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 1389–1406). Association for Computing Machinery. https://doi.org/10.1145/3133956.3134030
Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420. https://doi.org/10.1080/01621459.1989.10478785
Kuzu, M., Kantarcioglu, M., Inan, A., Bertino, E., Durham, E., & Malin, B. (2013). Efficient privacy-aware record integration. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 167–178). Association for Computing Machinery. https://doi.org/10.1145/2452376.2452398
Lahiri, P., & Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100(469), 222–230. https://doi.org/10.1198/016214504000001277
Lin, S., Bun, M., Gaboardi, M., Kolaczyk, E. D., & Smith, A. (2024). Differentially private confidence intervals for proportions under stratified random sampling. Electronic Journal of Statistics, 18(1), 1455–1494. https://doi.org/10.1214/24-EJS2234
Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60(312), 1005–1027. https://doi.org/10.1080/01621459.1965.10480846
Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130(3381), 954–959. https://doi.org/10.1126/science.130.3381.954
Rao, F.-Y., Cao, J., Bertino, E., & Kantarcioglu, M. (2019). Hybrid private record linkage: Separating differentially private synopses from matching records. ACM Transactions on Privacy and Security, 22(3), Article 15. https://doi.org/10.1145/3318462
Salvati, N., Fabrizi, E., Ranalli, M. G., & Chambers, R. L. (2021). Small area estimation with linked data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(1), 78–107. https://doi.org/10.1111/rssb.12401
Sanil, A. P., Karr, A. F., Lin, X., & Reiter, J. P. (2004). Privacy preserving regression modelling via distributed computation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 677–682). https://doi.org/10.1145/1014052.1014139
Sarathy, J., & Vadhan, S. (2024). Analyzing the differentially private Theil-Sen estimator for simple linear regression. ArXiv. https://doi.org/10.48550/arXiv.2207.13289
Scheuren, F., & Winkler, W. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19, 39–58.
Steorts, R. C., Ventura, S. L., Sadinle, M., & Fienberg, S. E. (2014). A comparison of blocking methods for record linkage. In J. Domingo-Ferrer (Ed.), Privacy in statistical databases (pp. 253–268). Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_20
Vu, D., & Slavkovic, A. (2009). Differential privacy for clinical trial data: Preliminary evaluations. In Y. Saygin, J. Xu Yu, H. Kargupta, W. Wang, S. Ranka, P. S. Yu, & X. Wu (Eds.), 2009 IEEE International Conference on Data Mining Workshops (pp. 138–143). IEEE. https://doi.org/10.1109/ICDMW.2009.52
Wang, Y.-X. (2018). Revisiting differentially private linear regression: Optimal and adaptive prediction & estimation in unbounded domain. In A. Globerson & R. Silva (Eds.), Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference (Article 93–103). AUAI Press. http://auai.org/uai2018/proceedings/papers/40.pdf
Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (2022). Regression with linked datasets subject to linkage error. WIREs Computational Statistics, 14(4), Article e1570. https://doi.org/10.1002/wics.1570
Zhang, L.-C., & Tuoto, T. (2021). Linkage-data linear regression. Journal of the Royal Statistical Society: Series A (Royal Statistical Society), 184(2), 522–547. https://doi.org/10.1111/rssa.12630
Supplementary File
Complete proofs of all theorems can be found in the supplementary file.