There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream preprocessing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this article, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy–accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.
Keywords: differential privacy, record linkage, data integration, privacy-preserving record linkage, gradient descent
Differential privacy is a mathematical framework for ensuring the privacy of individuals in data sets. It mitigates the privacy risk of disclosing sensitive information about individuals within the data set during data analysis. Under such a framework, we are interested in finding the relationship between two variables (via statistical regression) after they are linked from two data sources with uncertainties. A preprocessing procedure of linking data sets is called record linkage, and the uncertainties should be taken into account in the downstream analysis. In the article, we propose two algorithms that satisfy differential privacy for regression estimation problems with linked data. The theoretical results regarding privacy guarantees and statistical accuracy are provided. We demonstrate the performance of the proposed algorithms through simulations and an application.
Data for the same group of entities are often scattered across different resources, lacking unique identifiers for perfect linkage. To conduct statistical modeling or inference on the integrated information, it is necessary to probabilistically link multiple data sets by comparing the common quasi-identifiers (e.g., names, gender, address) as a preprocessing step. Such a procedure is called record linkage (RL), also known as entity resolution, or data matching (Christen, 2012), which is an essential component of data integration in big data analytics (Dong & Srivastava, 2015). Thanks to its wide application in many disciplines such as public health and official statistics, record linkage has been studied for decades. Earlier pioneering works include Fellegi and Sunter (1969), Jaro (1989), Newcombe et al. (1959). In addition, record linkage is frequently used in current practice. The U.S. Census Bureau has a long tradition using record linkage methodology for multiple endeavors. A current prominent example is the Decennial Census (U.S. Census Bureau, 2022). In this context, record linkage involves using administrative records and other data sources to improve data quality, with efforts underway to construct a comprehensive ‘reference database’ including individuals from multiple administrative records. A recent review paper, Binette and Steorts (2022), provided a comprehensive summary of record linkage. Broadly speaking, there are two perspectives regarding record linkage (Salvati et al., 2021): (1) the primary viewpoint concerns how to link the records; (2) the secondary perspective is focused on how to propagate the uncertainty to the downstream statistical learning tasks after the linkage has been determined. Our focus in this article will adopt the second of these two perspectives.
Closely related to record linkage is data privacy. In the area of privacy-preserving record linkage (PPRL), two or more private data sets owned by different organizations are linked without revealing the data to one another (Christen et al., 2020; Hall & Fienberg, 2010). The outcome of PPRL is the information regarding which pairs or sets are matched. PPRL, in turn, is associated with secure multiparty computation (SMPC) in that SMPC techniques are commonly used to solve PPRL problems (He et al., 2017; Kuzu et al., 2013; Rao et al., 2019). PPRL to date only engages in the private linkage process from a primary perspective, without concerning how the linkage uncertainties would impact the downstream analysis. On the contrary, the secondary perspective is to modify the statistical tools to account for the linkage uncertainty. Our goal is to incorporate privacy into the secondary perspective of record linkage, which is different from, yet complementary to PPRL or SMPC.
Privacy concerns have, if anything, become significantly more exacerbated with the emergence of individual-level big data. Releasing information about a sensitive data set is subject to a variety of privacy attacks (Dwork et al., 2017). Therefore, there has been a growing demand for establishing robust privacy-preserving methodologies for modern statistics and machine learning. A mathematical framework proposed by Dwork et al. (2006), differential privacy (DP), is now considered the gold standard for rigorous privacy protection and has made its way to broad application in industry (Apple Differential Privacy Team, 2017; Bind, 2020; Google, 2021) and the public sector (U.S. Census Bureau, 2021). The literature on differential privacy has been flourishing in recent years and the interface of differential privacy and statistics has started to draw increasing attention from the statistics community.
Recent work on differential privacy focuses primarily on individual statistical and machine learning tasks, with nontrivial upstream preprocessing, such as record linkage, typically not incorporated. In this article, we consider the linear regression problem, that is,
but where
Specifically, we propose two algorithms for linear regression after record linkage to meet differential privacy: (1) post-RL noisy gradient descent (NGD), and (2) post-RL sufficient statistics perturbation (SSP). Our work builds on the seminal work by Lahiri and Larsen (2005) where an estimator is proposed for linear regression with linked data in a non-privacy-aware setting. We construct a private estimator,
The two proposed algorithms also extend the noisy gradient method (Bassily et al., 2014) and the “Analyze Gauss” algorithm (Dwork et al., 2014), which are applied to linear regression, to additionally handle the presence of linkage errors. Prior works (Alabi et al., 2020; Bernstein & Sheldon, 2019; Cai et al., 2021; Sheffet, 2017; Y.-X. Wang, 2018) on differentially private linear regression do not consider possible record linkage preprocessing. If the data are linked beforehand, directly applying their algorithms to the imperfectly linked data is not ideal. It is well known that overlooking the linkage errors leads to substantial bias even with a high-linkage accuracy (Neter et al., 1965; Scheuren & Winkler, 1993). Figure 2 showcases a toy example of record linkage, where mismatches, if treated as true, change the sign of the slope estimate. Our illustrative application later in the article confirms this, where around 90% of the records are correctly linked, and the estimators ignoring linkage errors end up with large biases.
Accompanying the estimators resulting from our algorithms, we provide mean-squared error bounds under typical regularity assumptions and record linkage schemes. When no linkage errors are present (i.e., a special case in our scenario), our result in Theorem 4.4 improves upon the noisy gradient method proposed in Cai et al. (2021) by using zero-concentrated differential privacy (Bun & Steinke, 2016) to enable tighter bounds on privacy cost (see Lemma 2.3). Additionally, we have presented (approximate) theoretical variances for
The remainder of this article is organized as follows. Section 2 provides preliminaries on linear regression with linked data and differential privacy. We propose our two algorithms in Section 3 and present the relevant theoretical results in Section 4. In Section 5, we conduct a series of simulation studies and an application to synthetic data. Section 6 concludes and discusses future work. Complete proofs of all theorems can be found in the supplementary materials.
In this section, we review the background results of linear regression after record linkage upon which we build our work, and fundamental concepts from differential privacy. Related work on linear regression with linked data and record linkage with privacy awareness are discussed.
Let
then
For the fixed-design homoskedastic linear model (1.1), when inference is done after record linkage based on
where
In addition, the variance of
where
Note that
The matching probability matrix (MPM)
Blocking Scheme. It is assumed that the MPM is a block diagonal matrix, which means the true matches only happen within blocks. Blocking significantly reduces the number of pairs for comparison and allows scalable record linkage. This scheme is used in almost all real-world applications, and different methods for blocking have been developed (Christen, 2012; Christophides et al., 2020; Steorts et al., 2014).
Exchangeable Linkage Errors (ELE) Model. The ELE model (Chambers, 2009) assumes homogeneous linkage accuracy and errors:
The ELE model has been adopted in recent works, such as Salvati et al. (2021) and Chambers et al. (2023), for various estimation problems. Even though (2.4) may oversimplify the reality, it is a representative model for a secondary analyst who has minimum information about the linkage quality. When blocking is used, the homogeneous linkage accuracy assumption is imposed within individual blocks. In other words, it still allows heterogeneous linkage accuracy between blocks.
Let
The expression (2.5) controls the distance between the output distributions on two neighboring data sets through the privacy budget
Differential privacy enjoys the following properties that facilitate the construction of differentially private algorithms.
Generally, a differentially private algorithm is constructed by adding random noise from a certain structured distribution, such as the Laplace or Gaussian distributions. A notion central to the amount of noise we add is the sensitivity of the estimation function we desire to release privately.
The sensitivity of a function characterizes how much the output would change if one record in the data set changes. To achieve
Combining the basic composition rule and the Gaussian mechanism, for a sequence of functions
where
Please refer to the supplementary materials for details. Since, in most practical budget settings, we have
In Section 3, we shall employ Lemmas 2.2 and 2.3 in devising two distinct algorithms for linear regression after record linkage.
Linear regression with linked data is a fundamental statistical task that has been explored in various articles. Scheuren and Winkler (1993) initially considered the linkage model (2.1) for linear regression and proposed an estimator that is not generally unbiased. Later, Lahiri and Larsen (2005) introduced an exactly unbiased ordinary least squares (OLS)–like estimator given in (2.2) with an expression for the variance, which outperformed the approach by Scheuren and Winkler (1993). Besides, Chambers (2009) and Zhang and Tuoto (2021) offered a few other estimators. According to their simulation studies, some of the estimators provided performance that was at most similar, but not noticeably better, compared to the one proposed by Lahiri and Larsen (2005). Yet, Zhang and Tuoto (2021) relaxed the condition by not assuming that the probability of correct linkage,
On the other hand, there is ongoing research on privacy-preserving record linkage in the field of computer science. PPRL aims to privately link multiple sensitive data sets held by different organizations when they are unwilling or not permitted to share their data with external parties due to privacy and confidentiality concerns. To achieve privacy protection, techniques such as SMPC and DP are combined with machine learning and deep learning methods for conducting PPRL (Christen et al., 2020; Gkoulalas-Divanis et al., 2021; Ranbaduge et al., 2022). PPRL primarily concerns data leakage during the linkage process and produces a linked data set that can be used for further analysis, yet most applications treat the linked data as if there were no linkage errors. Neither the uncertainty propagation nor the private release of the downstream analysis is considered within the scope of PPRL.
Note that there are several articles on privacy-preserving analysis on vertically partitioned databases. In these databases, the attributes are distributed among multiple parties, but common unique identifiers exist to facilitate data linkage across the different parties. Unlike probabilistic record linkage, vertically partitioned databases do not involve linkage errors. Du et al. (2004), Gascón et al. (2017), Hall et al., (2011), and Sanil et al. (2004) discussed the implementations of privacy-preserving linear regression protocols that prevent data disclosure across organizations, whereas Dwork and Nissim (2004) considered data mining from the perspective of the private release of statistical querying in a spirit similar to our work.
The unbiased and simply structured estimator provided in (2.2) with a known closed-form variance makes it a suitable prototype to construct our private estimators. We introduce two differentially private algorithms in the following, based on (1) noisy gradient descent and (2) sufficient statistics perturbation. As the names suggest, we mitigate privacy risk by perturbing either the gradient or sufficient statistics during the computation of the linear model. Hereafter, if not specified otherwise,
Gradient descent methods are ubiquitous in scientific computing for numerous optimization problems. Within the framework of differential privacy, Bassily et al. (2014) provided a noisy variant of the classic gradient descent algorithm. It was later adapted by Cai et al. (2021) to solve the classic linear regression problem with faster convergence. Leveraging the work by Bassily et al. (2014) and Cai et al. (2021), we tailor the noisy gradient method for the post-RL linear regression model for
Let
Input: Linked data set |
Algorithm 1 is a modified version of the projected gradient descent that incorporates (1) post-RL transformation of the design matrix, (2) addition of noise
The major challenge lies in calculating the sensitivity. In the non-RL least squares regression, two neighboring data sets
Noise can be injected into the process besides the gradient computation. Since the estimator interacts with the data through its (joint) sufficient statistics, an efficient way is to perturb the sufficient statistics to protect the data. Such a technique, sufficient statistics perturbation, has been used in previous works such as Foulds et al. (2016), Vu and Slavkovic (2009), and Y.-X. Wang (2018). For the nonprivate OLS estimator
In our work, we adapt the “Analyze Gauss” algorithm to linear regression after record linkage, as shown in Algorithm 2. The noise scale factor
Input: Linked data set |
Remark 3.1. In step 4, by postprocessing, checking for singularity of
An alternative approach to implementing the SSP method is to add random noise separately to each sufficient statistic. In this approach, the total privacy budget should be divided between
In this section, we provide the theoretical results of the two algorithms introduced in Section 3. The results are threefold: (1) differential privacy guarantees, (2) finite-sample error bounds, and (3) variances of the private estimators. We present each of these along with a discussion of the corresponding conditions as they relate to the main variables in our record linkage model. All proofs for these results can be found in the supplementary materials.
The algorithms are designed to achieve certain privacy guarantees, given the corresponding sensitivity, for the post-RL case:
Essentially, we assume that the data domain is bounded, which is critical for deriving a finite sensitivity of the target function on the data. (A1) is a standard assumption for a bounded design
The rationale of (A2) is supported by typical schemes imposed on the structures of MPM in practice, as reviewed in Section 2.1.1. For example, with the blocking scheme, the size of each block is manageably small (O(1)). When one record is altered, the fluctuation of the MPM is limited to at most two blocks. Additionally, with the ELE model (2.4), as long as the changes to a single record only affect a finite number of records, the linkage accuracy
The proofs of Theorem 4.1 revolve around calculating the sensitivity of the target function in each algorithm. Besides the upper bounds
We study the accuracy of the proposed estimators by deriving the finite-sample error bounds. In the following, we introduce two more assumptions in addition to (A1) and (A2):
(A3) The true parameter
(A4) The minimum and maximum eigenvalues of
for some constant
Assumption (A4) implies the smoothness and strong convexity of the loss function
As a special case, when the linkage is perfect (i.e.,
For the two proposed algorithms, we present upper bounds of the excess squared error of the private estimators, namely,
step size
truncation level
noise scale factor
Remark 4.1. More specifically, to include the dependency on the assumed constants, we have the following high probability bound
where
truncation level
noise scale factor
Remark 4.2. More specifically, to include the dependency on the assumed constants, we have the following high probability bound
where
In both algorithms, the response is projected with a level
In the NGD method, the bound consists of two parts on the right-hand side in (4.5). The first error term
For the SSP algorithm, the convergence rate in (4.7) depends on similar variables as in the NGD algorithm. The major difference is that it is controlled by
Putting together Lemma 4.3 and Theorems 4.4 and 4.5, we obtain a high-probability error bound for each algorithm as follows.
Corollary 4.1. Under the regularity conditions (A1)-(A4),
(i) (Post-RL NGD)
with probability at least
(ii) (Post-RL SSP)
with probability at least
Although a few works (Alabi, 2022; Sarathy & Vadhan, 2024; Sheffet, 2017) have addressed uncertainty of DP estimators through confidence bounds and intervals, and similarly through the release of private estimates of variance, the exact variance of DP estimators is rarely determined in most cases. Recent work, such as Lin et al. (2024), has explored the exact variance of the private estimators for population proportions that have fairly simple structures. The main barrier to the inspection of variance is that if the noise is injected into the intermediate steps of the estimation process other than the output, then it is difficult to track the variability that noise introduces to the output estimator due to the intricate nature of the algorithm.
The NGD and SSP algorithms are two examples where noise is added in the middle of the estimation process. The operations like function composition and taking the inverse complicate the inspection of the variance of the output estimator
Remark 4.3. In the nonprivate case where
which is exactly the variance of
The estimator in Algorithm 1 is a projected variant of (4.11). The use of projection with level
Remark 4.4. As shown in the proof of 4.7 (see the supplemental), the proxy estimator
The variance of
To evaluate the finite-sample performance of the proposed algorithms, we conduct a series of simulation studies and an application to a synthetic data set that contains real data.
In this section, we conduct simulation studies to assess the performance of the two proposed algorithms for simple linear regression with linked data. The nonprivate OLS estimator and RL estimator
^{Figure 3. Average }
For each simulation, a fixed design matrix
Two sets of simulations are conducted to explore the performance with varying sample size
ELE linkage model: blockwise linkage accuracy
Settings 1 and 2:
Setting 3: the linkage accuracy
regression model:
Setting 1:
Setting 2:
Setting 3:
privacy budget:
In setting 1, where
Figures 3b, 3c, and 3f illustrate the empirical variances (EMP) against the theoretical variances (THR) of the proxy estimators given in Section 4.3. The theoretical variance of post-RL NGD closely aligns with the empirical variance at the chosen level of projection
Due to privacy concerns, pairs of data sets containing personal information, which serve as quasi-identifiers, are typically not made public. We instead synthesize from a pair of generated quasi-identifiers data sets and real data for regression, as in Salvati et al. (2021). For quasi-identifiers, we take advantage of the data sets generated by the Freely Extensible Biomedical Record Linkage (Febrl), which are available in the module RecordLinkage
by de Bruin (2022) in Python. The pair of data sets for linkage we use correspond to 5,000 individuals. The domain indicator (state) is used for blocking. The record linkage is performed based on the Jaro-Winkler score (Jaro, 1989) or exact comparison on six quasi-identifiers (names, date of birth, address, etc.). The maximum score is 6 for pairs that have exact alignment. A threshold of 4 is chosen to link the records. For those left unlinked, we assign random links to ensure one-to-one linkage. A unique identifier is available in the data set for verification purposes. The resulting linkage accuracy for the nine blocks are
On the other hand, an anonymous data set for regression comes from the Survey on Household Income and Wealth (SHIW) from the Bank of Italy (2014). The net disposable income and consumption are the explanatory variable
To apply the proposed DP algorithms to the synthesized data set, we set the (hyper)parameters as follows. The privacy budget is given by
To illustrate the importance of propagating linkage uncertainty when conducting downstream regression, we also apply the non-RL version of NGD and SSP algorithms. We obtain the non-RL regression results by running post-RL NGD and post-RL SSP methods with
Figure 5 displays the boxplots of the estimates of each algorithm. For each algorithm, a total of 1,000 repetitions are done in order to reflect the randomness of the injected noise for privacy purposes. The variables
In this article, we propose two differentially private algorithms for linear regression on a linked data set that contains linkage errors, by leveraging the existing work on (1) linear regression after record linkage and (2) differentially private linear regression. Figure 6 displays the connections among the related areas at a high level, including PPRL and SMPC mentioned in Sections 1 and 2.3. Our work is the first one to simultaneously consider the linkage uncertainty propagation and the privatization of the output. It also complements the area of PPRL where the main concern is the data leakage among different parties. However, we do not discuss how to link the records in the first place and thus the security issues of the linkage process are beyond our scope. Instead, we treat record linkage from a secondary perspective: we begin with linked data prepared by an external entity and we have limited information about the linkage quality.
Specifically, we propose two post-RL algorithms based on the noisy gradient descent and sufficient statistics perturbation methods from the DP literature. We provide privacy guarantees and finite-sample error bounds for these algorithms and discuss the variances of the private estimators. Our simulation studies and the application demonstrate the following: (1) the proposed estimators converge as the sample size increases; (2) post-RL linear regression incurs a higher cost than the non-RL counterpart in terms of the privacy–accuracy tradeoff; (3) The NGD method is flexible with hyperparameter tuning and can be applied to more general optimization problems; and (4) SSP is specific to the least-squares problem, offering greater budget efficiency and more accurate results provided that the random error of the regression model is not too large.
There are different directions to extend our work. Note that there may be different scenarios of linking between the two data sets of the same set of entities. Assuming one-to-one linkage, as in our article, is a canonical scenario. Although we do not explore it, we expect that our methods can be extended to other scenarios (e.g., one-to-many linkage) where
One can also consider record linkage from a primary perspective. In addition to the traditional Fellegi–Sunter model, Bayesian approaches and machine learning–based methods have gained popularity. The record linkage may take forms other than the matching probability matrix adopted here. Furthermore, when privacy concerns arise during the linkage process involving different parties, PPRL and SMPC protocols become essential. Tackling all the challenges depicted in Figure 6 simultaneously with a single efficient tool is of great practical use and significance. This interdisciplinary challenge requires expertise in both statistics and computer science.
Another important direction is exploring related statistical problems in the post-RL context, with or without privacy constraints. For example, confidence intervals and hypothesis testing are fundamental statistical inference tools. Other potential problems that interest statisticians include high-dimensional linear regression and ridge regression.
The research was supported in part by the U.S. Census Bureau Cooperative Agreement CB20ADR0160001 and Canadian NSERC grant RGPIN-2023-03566. The authors would like to thank Adam Smith (Boston University) for all the helpful discussions and comments.
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308–318). Association for Computing Machinery. https://doi.org/10.1145/2976749.2978318
Alabi, D., Smith, A., McMillan, A., Vadhan, S., & Sarathy, J. (2020). Differentially private simple linear regression. ArXiv. https://doi.org/10.48550/arXiv.2007.05157
Alabi, D. G. (2022). The algorithmic foundations of private computational social science [Unpublished doctoral dissertation]. Harvard University Graduate School of Arts and Sciences.
Apple Differential Privacy Team. (2017, December 6). Learning with privacy at scale. Apple Machine Learning Research. https://machinelearning.apple.com/research/learning-with-privacy-at-scale
Bank of Italy. (2014). Survey on Household Income and Wealth - 2012. https://www.bancaditalia.it/statistiche/tematiche/indagini-famiglie-imprese/bilanci-famiglie/distribuzione-microdati/ricerca/ricerca.html?com.dotmarketing.htmlpage.language=1&min_anno_pubblicazione=2014&max_anno_pubblicazione=2014
Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (pp. 464–473). https://doi.org/10.1109/FOCS.2014.56
Bernstein, G., & Sheldon, D. (2019). Differentially private Bayesian linear regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 525–535). Curran Associates Inc. https://papers.nips.cc/paper_files/paper/2019/hash/f90f2aca5c640289d0a29417bcb63a37-Abstract.html
Bind, S. (2020, October 28). Putting differential privacy into practice to use data responsibly. Microsoft. https://blogs.microsoft.com/ai-for-business/differential-privacy/
Binette, O., & Steorts, R. C. (2022). (Almost) all of entity resolution. Science Advances, 8(12), Article eabi8021. https://doi.org/10.1126/sciadv.abi8021
Bun, M., & Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In M. Hirt & A. Smith (Eds.), Theory of cryptography (pp. 635–658). Springer.
Cai, T. T., Wang, Y., & Zhang, L. (2021). The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5), 2825–2850. https://doi.org/10.1214/21-AOS2058
Chambers, R. (2009). Regression analysis of probability-linked data. Statistics New Zealand. http://www.statisphere.govt.nz/further-resources-and-info/official-statistics-research/series/volume-4-2009.aspx#2
Chambers, R., Fabrizi, E., Ranalli, M. G., Salvati, N., & Wang, S. (2023). Robust regression using probabilistically linked data. WIREs Computational Statistics, 15(2), Article e1596. https://doi.org/10.1002/wics.1596
Chipperfield, J. (2020). Bootstrap inference using estimating equations and data that are linked with complex probabilistic algorithms. Statistica Neerlandica, 74(2), 96–111. https://doi.org/10.1111/stan.12189
Chipperfield, J. O., & Chambers, R. L. (2015). Using the bootstrap to account for linkage errors when analysing probabilistically linked categorical data. Journal of Official Statistics, 31(3), 397–414. https://doi.org/10.1515/jos-2015-0024
Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.
Christen, P., Ranbaduge, T., & Schnell, R. (2020). Linking sensitive data: Methods and techniques for practical privacy-preserving information sharing. Springer Cham. https://doi.org/10.1007/978-3-030-59706-1
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., & Stefanidis, K. (2020). An overview of end-to-end entity resolution for big data. ACM Computing Surveys, 53(6). https://doi.org/10.1145/3418896
de Bruin, J. (2022). Record linkage toolkit documentation. Read the Docs. https://recordlinkage.readthedocs.io/en/latest/
Dong, X. L., & Srivastava, D. (2015). Record linkage. In Big data integration (pp. 63–106). Springer International. https://doi.org/10.1007/978-3-031-01853-4_3
Du, W., Han, Y. S., & Chen, S. (2004). Privacy-preserving multivariate statistical analysis: Linear regression and classification. In M. W. Berry, U. Dayal, C. Kamath, & D. B. Skillicorn (Eds.), Proceedings of the fourth SIAM international conference on data mining (pp. 222–233). SIAM. https://doi.org/10.1137/1.9781611972740.21
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006). Our data, ourselves: Privacy via distributed noise generation. In S. Vaudenay (Ed.), Advances in Cryptology - EUROCRYPT 2006 (pp. 486–503). Springer Berlin Heidelberg. https://doi.org/10.1007/11761679_29
Dwork, C., & Nissim, K. (2004). Privacy-preserving datamining on vertically partitioned databases. In M. K. Franklin (Ed.), Advances in Cryptology - CRYPTO 2004, 24th Annual International Cryptology Conference (Vol. 3152, pp. 528–544). Springer. https://dx.doi.org/10.1007/978-3-540-28628-8_32
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4), 211–407. http://dx.doi.org/10.1561/0400000042
Dwork, C., Smith, A. D., Steinke, T., & Ullman, J. (2017). Exposed! A survey of attacks on private data. Annual Review of Statistics and Its Application, 4, 61–84. https://api.semanticscholar.org/CorpusID:26766335
Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014). Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11–20). Association for Computing Machinery. https://doi.org/10.1145/2591796.2591883
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049
Foulds, J., Geumlek, J., Welling, M., & Chaudhuri, K. (2016). On the theory and practice of privacy-preserving Bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (pp. 192–201). AUAI Press.
Gascón, A., Schoppmann, P., Balle, B., Raykova, M., Doerner, J., Zahur, S., & Evans, D. (2017). Privacy-preserving distributed linear regression on high-dimensional data. Proceedings on Privacy Enhancing Technologies, 2017(4), 345–364. https://doi.org/10.1515/popets-2017-0053
Gkoulalas-Divanis, A., Vatsalan, D., Karapiperis, D., & Kantarcioglu, M. (2021). Modern privacy-preserving record linkage techniques: An overview. IEEE Transactions on Information Forensics and Security, 16, 4966–4987. https://doi.org/10.1109/TIFS.2021.3114026
Google. (2021, January 28). How we’re helping developers with differential privacy. https://developers.googleblog.com/en/how-were-helping-developers-with-differential-privacy/
Hall, R., & Fienberg, S. E. (2010). Privacy-preserving record linkage. In J. Domingo-Ferrer & E. Magkos (Eds.), Privacy in statistical databases (pp. 269–283). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15838-4_24
Hall, R., Fienberg, S. E., & Nardi, Y. (2011). Secure multiple linear regression based on homomorphic encryption. Journal of Official Statistics, 27, 669–691.
He, X., Machanavajjhala, A., Flynn, C., & Srivastava, D. (2017). Composing differential privacy and secure computation: A case study on scaling private record linkage. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 1389–1406). Association for Computing Machinery. https://doi.org/10.1145/3133956.3134030
Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420. https://doi.org/10.1080/01621459.1989.10478785
Kuzu, M., Kantarcioglu, M., Inan, A., Bertino, E., Durham, E., & Malin, B. (2013). Efficient privacy-aware record integration. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 167–178). Association for Computing Machinery. https://doi.org/10.1145/2452376.2452398
Lahiri, P., & Larsen, M. D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100(469), 222–230. https://doi.org/10.1198/016214504000001277
Lin, S., Bun, M., Gaboardi, M., Kolaczyk, E. D., & Smith, A. (2024). Differentially private confidence intervals for proportions under stratified random sampling. Electronic Journal of Statistics, 18(1), 1455–1494. https://doi.org/10.1214/24-EJS2234
Neter, J., Maynes, E. S., & Ramanathan, R. (1965). The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60(312), 1005–1027. https://doi.org/10.1080/01621459.1965.10480846
Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130(3381), 954–959. https://doi.org/10.1126/science.130.3381.954
Ranbaduge, T., Vatsalan, D., & Ding, M. (2022). Privacy-preserving deep learning based record linkage. ArXiv. https://doi.org/10.48550/arXiv.2211.02161
Rao, F.-Y., Cao, J., Bertino, E., & Kantarcioglu, M. (2019). Hybrid private record linkage: Separating differentially private synopses from matching records. ACM Transactions on Privacy and Security, 22(3), Article 15. https://doi.org/10.1145/3318462
Salvati, N., Fabrizi, E., Ranalli, M. G., & Chambers, R. L. (2021). Small area estimation with linked data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(1), 78–107. https://doi.org/10.1111/rssb.12401
Sanil, A. P., Karr, A. F., Lin, X., & Reiter, J. P. (2004). Privacy preserving regression modelling via distributed computation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 677–682). https://doi.org/10.1145/1014052.1014139
Sarathy, J., & Vadhan, S. (2024). Analyzing the differentially private Theil-Sen estimator for simple linear regression. ArXiv. https://doi.org/10.48550/arXiv.2207.13289
Scheuren, F., & Winkler, W. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19, 39–58.
Sheffet, O. (2017). Differentially private ordinary least squares. Proceedings of Machine Learning Research, 70, 3105–3114. https://proceedings.mlr.press/v70/sheffet17a.html
Steorts, R. C., Ventura, S. L., Sadinle, M., & Fienberg, S. E. (2014). A comparison of blocking methods for record linkage. In J. Domingo-Ferrer (Ed.), Privacy in statistical databases (pp. 253–268). Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_20
U.S. Census Bureau. (2021). Disclosure avoidance for the 2020 Census: An introduction. U.S. Department of Commerce. https://www2.census.gov/library/publications/decennial/2020/2020-census-disclosure-avoidance-handbook.pdf
U.S. Census Bureau. (2022). Annual report of the center for statistical research and methodology. U.S. Department of Commerce. https://www.census.gov/content/dam/Census/library/publications/2022/adrm/2022-CSRM-Annual-Report.pdf
Vu, D., & Slavkovic, A. (2009). Differential privacy for clinical trial data: Preliminary evaluations. In Y. Saygin, J. Xu Yu, H. Kargupta, W. Wang, S. Ranka, P. S. Yu, & X. Wu (Eds.), 2009 IEEE International Conference on Data Mining Workshops (pp. 138–143). IEEE. https://doi.org/10.1109/ICDMW.2009.52
Wang, Y.-X. (2018). Revisiting differentially private linear regression: Optimal and adaptive prediction & estimation in unbounded domain. In A. Globerson & R. Silva (Eds.), Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference (Article 93–103). AUAI Press. http://auai.org/uai2018/proceedings/papers/40.pdf
Wang, Z., Ben-David, E., Diao, G., & Slawski, M. (2022). Regression with linked datasets subject to linkage error. WIREs Computational Statistics, 14(4), Article e1570. https://doi.org/10.1002/wics.1570
Zhang, L.-C., & Tuoto, T. (2021). Linkage-data linear regression. Journal of the Royal Statistical Society: Series A (Royal Statistical Society), 184(2), 522–547. https://doi.org/10.1111/rssa.12630
Complete proofs of all theorems can be found in the supplementary file.
©2024 Shurong Lin, Elliot Paquette, and Eric D. Kolaczyk. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.