Skip to main content
SearchLoginLogin or Signup

Deep Learning With Gaussian Differential Privacy

Published onSep 30, 2020
Deep Learning With Gaussian Differential Privacy
·
history

You're viewing an older Release (#4) of this Pub.

  • This Release (#4) was created on Sep 30, 2020 ()
  • The latest Release (#7) was created on May 23, 2022 ().

Abstract

Deep learning models are often trained on data sets that contain sensitive information such as individuals’ shopping transactions, personal contacts, and medical records. An increasingly important line of work therefore has sought to train neural networks subject to privacy constraints that are specified by differential privacy or its divergence-based relaxations. These privacy definitions, however, have weaknesses in handling certain important primitives (composition and subsampling), thereby giving loose or complicated privacy analyses of training neural networks. In this article, we consider a recently proposed privacy definition termed ff-differential privacy (Dong, Roth, et al., 2019) for a refined privacy analysis of training neural networks. Leveraging the appealing properties of ff-differential privacy in handling composition and subsampling, this article derives analytically tractable expressions for the privacy guarantees of both stochastic gradient descent and Adam used in training deep neural networks, without the need of developing sophisticated techniques as Abadi et al. (2016) did. Our results demonstrate that the ff-differential privacy framework allows for a new privacy analysis that improves on the prior analysis (Abadi et al., 2016), which in turn suggests tuning certain parameters of neural networks for a better prediction accuracy without violating the privacy budget. These theoretically derived improvements are confirmed by our experiments in a range of tasks in image classification, text classification, and recommender systems. Python code to calculate the privacy cost for these experiments is publicly available in the TensorFlow Privacy library.

Keywords: Gaussian differential privacy, deep learning, noisy gradient descent, central limit theorem, privacy accounting

Media Summary

In the era of big data, it is crucial and even urgent to develop algorithms that preserve the privacy of sensitive individual data while maintaining high utility. However, these two goals are generally competing with each other in the sense that privacy guarantees inevitably lead to utility loss. What is the optimal tradeoff between privacy and utility for a given machine learning task? This is a question that any data scientist caring about data privacy would ask.

In this article, the authors offer deep learning models with privacy guarantees in the framework of ff-differential privacy, which was recently proposed as a unifying generalization of several earlier privacy definitions. The authors train their private neural networks on multiple tasks that involve different types of data, including images, texts, and numeric data. Owing to the versatility of the ff-differential privacy framework, the experimental results demonstrate that this approach to private deep learning outperforms existing approaches in terms of the tradeoff between privacy and utility.


1. Introduction

In many applications of machine learning, the data sets contain sensitive information about individuals such as location, personal contacts, media consumption, and medical records. Exploiting the output of the machine learning algorithm, an adversary may be able to identify some individuals in the data set, thus presenting serious privacy concerns. This reality gave rise to a broad and pressing call for developing privacy-preserving data analysis methodologies. Accordingly, there have been numerous investigations in the scholarly literature of many fields—statistics, cryptography, machine learning, and law—for the protection of privacy in data analysis.

Along this line, research efforts have repeatedly suggested the necessity of a rigorous and versatile definition of privacy. Among other things, researchers have questioned whether the use of a privacy definition gives interpretable privacy guarantees, and if so, whether this privacy definition allows for high accuracy of the private model among alternative definitions. In particular, anonymization as a syntactic and ad hoc privacy concept has been shown to generally fail to guarantee privacy. Examples include the identification of a homophobic individual in the anonymized Netflix Challenge data set (Narayanan & Shmatikov, 2008) and the identification of the health records of the then-Massachusetts governor in public anonymized medical data sets (Sweeney, 1997).

In this context, (ε,δ)(\varepsilon, \delta)-differential privacy (DP) arose as a mathematically rigorous definition of privacy by Dwork et al. (2006). Today, this definition has developed into a firm foundation of private data analysis, with its applications deployed by Google (Erlingsson et al., 2014), Apple (2017), Microsoft (Ding et al., 2017), and the U.S. Census Bureau (Abowd, 2018). Despite its impressive popularity both in the scholarly literature and in industry, (ε,δ)(\varepsilon, \delta)-DP is not versatile enough to handle composition, which is perhaps the most fundamental primitive in statistical privacy. For example, the training process of deep neural networks is in effect the composition of many primitive building blocks known as stochastic gradient descent (SGD). Under a modest privacy budget in the (ε,δ)(\varepsilon, \delta)-DP sense, however, it was not clear how to maintain a high prediction accuracy of deep learning. This requires a tight privacy analysis of composition in the (ε,δ)(\varepsilon, \delta)-DP framework. Indeed, the analysis of the privacy costs in deep learning was refined only recently using a sophisticated technique called the moments accountant by Abadi et al. (2016).

Ideally, we hope to have a privacy definition that allows for refined privacy analyses of various algorithms in a principled manner, without resorting to sophisticated techniques. Having a refined privacy analysis not only enhances the trustworthiness of the models but can also be leveraged to improve the prediction accuracy by trading off privacy for utility. One possible candidate is ff-differential privacy, a relaxation of (ε,δ)(\varepsilon, \delta)-DP that was recently proposed by Dong, Roth, and Su (2019). This new privacy definition faithfully retains the hypothesis testing interpretation of differential privacy and can losslessly reason about common primitives associated with differential privacy, including composition, privacy amplification by subsampling, and group privacy. In addition, ff-DP includes a canonical single-parameter family that is referred to as Gaussian differential privacy (GDP). Notably, GDP is the focal privacy definition due to a central limit theorem (CLT) that states that the privacy guarantees of the composition of private algorithms are approximately equivalent to distinguish two shifted normal distributions.

The main results of this article show that ff-DP offers a rigorous and versatile framework for developing private deep learning methodologies.1 Our guarantee provides protection against an attacker with knowledge of the network architecture as well as the model parameters, which is in the same spirit as Shokri and Shmatikov (2015) and Abadi et al. (2016). In short, this article delivers the following messages concerning ff-DP:

  1. Closed-form privacy bounds. In the ff-DP framework, the overall privacy loss incurred in training neural networks admits an amenable closed-form expression. In contrast, the privacy analysis via the moments accountant must be done by numerical computation (see Abadi et al., 2016), and the implicit nature of this earlier approach can hinder our understanding of how the tuning parameters affect the privacy bound. This is discussed in Section 3.1.

  2. Stronger privacy guarantees. The ff-DP approach gives stronger privacy guarantees than the earlier approach by Abadi et al. (2016), even in terms of (ε,δ)(\varepsilon, \delta)-DP. This improvement is due to the use of the central limit theorem for ff-DP, which accurately captures the privacy loss incurred at each iteration in training the deep learning models. This is presented in Section 3.2. and illustrated with numerical experiments in Section 4.1.

  3. Improved prediction accuracy. Leveraging the stronger privacy guarantees provided by ff-DP, we can trade a certain amount of privacy for an improvement in prediction performance. This can be realized, for example, by appropriately reducing the amount of noise added during the training process of neural networks so as to match the target privacy level in terms of (ε,δ)(\varepsilon, \delta)-DP. See Section 3.2 and Section 4.2 for the development of this utility improvement.

The remainder of the article is structured as follows. In Section 1.1 we provide a brief review of related literature. Section 2 introduces ff-DP and its basic properties at a minimal level. Next, in Section 3 we analyze the privacy cost of training deep neural networks in terms of ff-DP and compare it to the privacy analysis using the moments accountant. In Section 4, we present numerical experiments to showcase the superiority of the ff-DP approach to private deep learning in terms of test accuracy and privacy guarantees. The article concludes with a discussion in Section 5.

1.1. Related Work

There are continued efforts to understand how privacy degrades under composition. Developments along this line include the basic composition theorem and the advanced composition theorem by Dwork, Kenthapadi, McSherry, Mironov, and Naor (2006) and Dwork, Rothblum, and Vadhan (2010). In a pioneering work, Kairouz, Oh, and Viswanath (2017) obtained an optimal composition theorem for (ε,δ)(\varepsilon, \delta)-DP, which in fact served as one of the motivations for the ff-DP work by Dong, Roth, et al. (2019). However, it is #P hard to compute the privacy bounds from their composition theorem (see Murtagh & Vadhan, 2016). More recently, Dong, Durfee, and Rogers (2019) derived sharp composition bounds on the overall privacy loss for exponential mechanisms.

From a different angle, a substantial recent effort has been devoted to relaxing differential privacy using divergences of probability distributions to overcome the weakness of (ε,δ)(\varepsilon, \delta)-DP in handling composition (see Bun et al., 2018; Bun & Steinke, 2016; Dwork & Rothblum, 2016; and Mironov, 2017). Unfortunately, these relaxations either lack a privacy amplification by subsampling argument or present a quite complex argument that is difficult to use as in Balle et al. (2018) and Wang et al. (2019). As subsampling is inherently used in training neural networks, therefore, it is difficult to directly apply these relaxations to the privacy analysis of deep learning.

To circumvent these technical difficulties associated with (ε,δ)(\varepsilon, \delta)-DP and its divergence-based relaxations, Abadi et al. (2016) invented a technique termed "the moments accountant" to track detailed information of the privacy loss in the training process of deep neural networks. Using the moments accountant, their analysis significantly improves on earlier privacy analysis of SGD (e.g., by Bassily et al., 2014; Shokri & Shmatikov, 2015; Song et al., 2013; and Zhang et al., 2017) and allows for meaningful privacy guarantees for deep learning trained on realistically sized data sets. This technique has been extended to a variety of situations by follow-up work from McMahan et al. (2018) and Pichapati et al. (2019). In contrast, our approach to private deep learning in the ff-DP framework leverages some powerful tools of this new privacy definition, nevertheless providing a sharper privacy analysis, as seen both theoretically and empirically in Sections 3 and 4.

For completeness, we remark that different approaches have been put forward to incorporate privacy considerations into deep learning, without leveraging the iterative and subsampling natures of training deep learning models. This line of work includes training a private model by an ensemble of “teacher” models by Papernot et al. (2018) and Papernot et al. (2016), the development of noised federated averaging algorithms by McMahan et al. (2017), and analyzing privacy costs through the lens of the optimization landscape of neural networks by Xiang et al. (2019).

2. Preliminaries

2.1. ff-Differential Privacy

In the differential privacy framework, we envision an adversary that is well-informed about the data set except for a single individual, and the adversary seeks to determine whether this individual is in the data set on the basis of the output of an algorithm. Roughly speaking, the algorithm is considered private if the adversary finds it hard to determine the presence or absence of any individual.

Informally, a data set can be thought of as a matrix, whose rows each contain one individual’s data. Two data sets are said to be neighbors if one can be derived by discarding an individual from the other. As such, the sizes of neighboring data sets differ by one.2 Let SS and SS' be neighboring data sets, and ε0,0δ1\varepsilon\geqslant 0, 0 \leqslant\delta \leqslant 1 be two numbers, and denote by MM a (randomized) algorithm that takes as input a data set.

Definition 2.1 (Dwork, Kenthapadi, et al., 2006; Dwork, McSherry, et al., 2006). A (randomized) algorithm MM gives (ε,δ)(\varepsilon,\delta)-differential privacy if for any pair of neighboring data sets S,SS,S' and any event EE,

P(M(S)E)eεP(M(S)E)+δ.\begin{aligned} \mathbb{P}(M(S)\in E)\leqslant \mathrm{e}^\varepsilon\mathbb{P}(M(S')\in E) +\delta. \end{aligned}

To achieve privacy, the algorithm MM is necessarily randomized, whereas the two data sets in Definition 2.1 are deterministic. This privacy definition ensures that, based on the output of the algorithm, the adversary has a limited (depending on how small ε,δ\varepsilon, \delta are) ability to identify the presence or absence of any individual, regardless of whether any individual opts into or opts out of the data set.

In essence, the adversary seeks to distinguish the two probability distributions M(S)M(S) and M(S)M(S') using a single draw. In light of this observation, it is natural to interpret what the adversary does as testing two simple hypotheses:

H0:the true data set is S versus H1:the true data set is S.H_0: \text{the true data set is } S \quad\text{ versus }\quad H_1: \text{the true data set is } S'.

The connection between differential privacy and hypothesis testing was, to our knowledge, first noted in Wasserman & Zhou (2010), and was later developed in Kairouz et al. (2017), Liu et al. (2019), and Balle et al. (2019). Intuitively, privacy is well guaranteed if the hypothesis testing problem is hard. Following this intuition, the definition of (ε,δ)(\varepsilon, \delta)-DP essentially uses the worst-case likelihood ratio of the distributions M(S)M(S) and M(S)M(S') to measure the hardness of testing the two simple hypotheses.

Is there a more informative measure of the hardness? In Dong, Roth, et al. (2019), the authors propose to use the trade-off between type I error (the probability of erroneously rejecting H0H_0 when H0H_0 is true) and type II error (the probability of erroneously accepting H0H_0 when H1H_1 is true) in place of a few privacy parameters in (ε,δ)(\varepsilon, \delta)-DP or divergence-based DP definitions. To formally define this new privacy definition, let PP and QQ denote the distributions of M(S)M(S) and M(S)M(S'), respectively, and let ϕ\phi be any (possibly randomized) rejection rule for testing H0:PH_0: P against H1:QH_1: Q. With these in place, Dong, Roth, et al. (2019) define the trade-off function of PP and QQ as

T(P,Q):[0,1][0,1]αinfϕ{1EQ[ϕ]:EP[ϕ]α}.\begin{aligned} T(P,Q):[0,1] & \mapsto [0,1]\\ \alpha & \mapsto \inf_{\phi}\{1 - \mathbb{E}_Q[\phi]: \mathbb{E}_P[\phi]\leqslant \alpha\}.\end{aligned}

Above, EP[ϕ]\mathbb{E}_P[\phi] and 1EQ[ϕ]1-\mathbb{E}_Q[\phi] are type I and type II errors of the rejection rule ϕ\phi, respectively. Writing f=T(P,Q)f = T(P,Q), the definition says that f(α)f(\alpha) is the minimum type II error among all tests at significance level α\alpha. Note that the minimum can be achieved by taking the likelihood ratio test, according to the Neymann–Pearson lemma. As is self-evident, the larger the trade-off function is, the more difficult the hypothesis testing problem is (hence more privacy). This motivates the following privacy definition.

Definition 2.2 (Dong, Roth, et al., 2019). A (randomized) algorithm MM is ff-differentially private if

T(M(S),M(S))fT\big(M(S), M(S')\big) \geqslant f

for all neighboring data sets SS and SS'.

In this definition, both TT and ff are functions that take α[0,1]\alpha\in[0,1] as input and the inequality holds pointwise for all 0α10 \leqslant\alpha \leqslant 1, and we abuse notation by identifying M(S)M(S) and M(S)M(S') with their associated distributions. This privacy definition is easily interpretable due to its inherent connection with the hypothesis-testing problem. By adapting a result due to Wasserman & Zhou (2010), (ε,δ)(\varepsilon, \delta)-DP is a special instance of ff-DP in the sense that an algorithm is (ε,δ)(\varepsilon, \delta)-DP if and only if it is fε,δf_{\varepsilon, \delta}-DP with

(2.1)fε,δ(α)=max{0,1δeεα,eε(1δα)}.(2.1) \qquad\qquad f_{\varepsilon, \delta}(\alpha) = \max\left\{ 0,1 - \delta - \mathrm{e}^\varepsilon\alpha, \mathrm{e}^{-\varepsilon}(1-\delta-\alpha) \right\}.

The more intimate relationship between the two privacy definitions is that they are dual to each other: briefly speaking, ff-DP ensures (ε,δ(ε))(\varepsilon, \delta(\varepsilon))-DP with δ(ε)=1+f(eε)\delta(\varepsilon) = 1 + f^*(-\mathrm{e}^{\varepsilon}) for every ε0\varepsilon\geqslant 0.3

Next, we define a single-parameter family of privacy definitions within the ff-DP class for a reason that will be apparent later. Let Gμ:=T(N(0,1),N(μ,1))G_\mu:= T\big(\mathcal{N}(0,1), \mathcal{N}(\mu,1)\big) for μ0\mu \geqslant 0. Note that this trade-off function admits a closed-form expression Gμ(α)=Φ(Φ1(1α)μ)G_\mu(\alpha)=\Phi(\Phi^{-1}(1-\alpha)-\mu), where Φ\Phi is the cumulative distribution function of the standard normal distribution.

Definition 2.3 (Dong, Roth, et al., 2019). A (randomized) algorithm MM is μ\mu-Gaussian differentially private (GDP) if

T(M(S),M(S))GμT\big(M(S), M(S')\big) \geqslant G_\mu

for all neighboring data sets SS and SS'.

In words, μ\mu-GDP says that determining whether any individual is in the data set is at least as difficult as distinguishing the two normal distributions N(0,1)\mathcal{N}(0, 1) and N(μ,1)\mathcal{N}(\mu, 1) based on one draw. The Gaussian mechanism serves as a template to achieve GDP. Consider the problem of privately releasing a univariate statistic θ(S)\theta(S). The Gaussian mechanism adds N(0,σ2)\mathcal{N}(0, \sigma^2) noise to the statistic θ\theta, which gives μ\mu-GDP if σ=sens(θ)/μ\sigma = \mathrm{sens}(\theta)/\mu. Here the sensitivity of θ\theta is defined as sens(θ)=supS,Sθ(S)θ(S)\mathrm{sens}(\theta) = \sup_{S, S'} | \theta(S) - \theta(S') |, where the supremum is over all neighboring data sets.

2.2. Properties of ff-Differential Privacy

Deep learning models are trained using the composition of many SGD updates. Broadly speaking, composition is concerned with a sequence of analyses on the same data set where each analysis is informed by the explorations of prior analyses. A central issue that every privacy definition is faced with is to pinpoint how the overall privacy guarantee degrades under composition. Formally, letting M1M_1 be the first algorithm and M2M_2 be the second, we define their composition algorithm MM as M(S)=(M1(S),M2(S,M1(S)))M(S) = (M_1(S), M_2(S, M_1(S))). Roughly speaking, the composition is to ‘release all information that is learned by the algorithms.’ Notably, the second algorithm M2M_2 can take as input the output of M1M_1 in addition to the data set SS. In general, the composition of more than two algorithms follows recursively.

To introduce the composition theorem for ff-DP, Dong, Roth, et al. (2019) define a binary operation \otimes on trade-off functions. Given trade-off functions f=T(P,Q)f = T(P, Q) and g=T(P,Q)g = T(P', Q'), let fg=T(P×P,Q×Q)f\otimes g = T(P\times P', Q\times Q'). This definition depends on the distributions P,Q,P,QP,Q,P',Q' only through ff and gg. Moreover, \otimes is commutative and associative. Now the composition theorem can be readily stated as follows. Let MtM_t be ftf_t-DP conditionally on any output of the prior algorithms for t=1,,Tt = 1, \ldots, T. Then their TT-fold composition algorithm is f1fTf_1\otimes\cdots\otimes f_T-DP. This result shows that the composition of algorithms in the ff-DP framework is reduced to performing the \otimes operation on the associated trade-off functions. As an important fact, the privacy bound f1fTf_1\otimes\cdots\otimes f_T in general cannot be improved. Put more precisely, one can find an ftf_t-DP mechanism MtM_t for t=1,,Tt = 1, \ldots, T such that their composition is precisely f1fTf_1\otimes\cdots\otimes f_T-DP (see the discussion following Theorem 4 in Dong, Roth, et al., 2019).

More profoundly, a central limit theorem phenomenon arises in the composition of many ‘very private’ ff-DP algorithms in the following sense: the trade-off functions of small privacy leakage accumulate to GμG_{\mu} for some μ\mu under composition. Informally, assuming each ftf_t is very close to Id(α)=1α\mathrm{Id}(\alpha) = 1 - \alpha, which corresponds to perfect privacy, then we have


(2.2)f1f2fT is approximately Gμ(2.2) \qquad\qquad f_1\otimes f_2 \otimes \cdots \otimes f_T \text{ is approximately } G_\mu


if the number of iterations TT is sufficiently large.4 This central limit theorem approximation is especially suitable for the privacy analysis of deep learning, where the training process typically takes at least tens of thousands of iterations. The privacy parameter μ\mu depends on some functionals such as the Kullback–Leibler divergence of the trade-off functions. The central limit theorem yields a very accurate approximation in the settings considered in Section 4 (see numerical confirmation in Appendix A). For a rigorous account of this central limit theorem for differential privacy, see Theorem 6 in Dong, Roth, et al. (2019). We remark that a conceptually related article by Sommer et al. (2018) developed a central limit theorem for privacy loss random variables.

At a high level, this convergence-to-GDP result brings GDP to the focal point of the family of ff-DP guarantees, implying that GDP is to ff-DP as normal random variables are to general random variables. Furthermore, this result serves as an effective approximation tool for approximating the privacy guarantees of composition algorithms. In contrast, privacy loss cannot be losslessly tracked under composition in the (ε,δ)(\varepsilon, \delta)-DP framework.

Next, we turn to the subsampling property of ff-DP. In training neural networks, the gradient at each iteration is computed from a mini-batch that is subsampled from the training examples. Intuitively, an algorithm applied to a subsample gives stronger privacy guarantees than when applied to the full sample. Looking closely, this privacy amplification is due to the fact that an individual enjoys perfect privacy if not selected in the subsample. A concrete and pressing question is, therefore, to precisely characterize how much privacy is amplified by subsampling in the ff-DP framework.

Consider the following sampling scheme: for each individual in the data set SS, include his or her datum in the subsample independently with probability pp, which is sometimes referred to as the Poisson subsampling (see Wang et al., 2019). The resulting subsample is denoted by Samplep(S)\mathtt{Sample}_p(S). For the purpose of clearing up any confusion, we remark that the subsample Samplep(S)\mathtt{Sample}_p(S) has a random size and as an intermediate step is not released. Given any algorithm MM, denote by MSamplepM \circ \mathtt{Sample}_p the subsampled algorithm.

The subsampling theorem for ff-DP states as follows. Let MM be ff-DP, write fpf_p for pf+(1p)Idpf+(1-p)\mathrm{Id}, and denote by fp1f_p^{-1} the inverse5 of fpf_p. It is proved in Appendix A that the subsampled algorithm MSamplepM \circ \mathtt{Sample}_p satisfies

(2.3)T(MSamplep(S),MSamplep(S))fp(2.3) \qquad\quad T\big(M\circ\mathtt{Sample}_p(S),M\circ\mathtt{Sample}_p(S')\big)\geqslant f_p

if SS can be obtained by removing one individual from SS'. Likewise,

T(MSamplep(S),MSamplep(S))fp1.\qquad \quad T\big(M\circ\mathtt{Sample}_p(S'),M\circ\mathtt{Sample}_p(S)\big)\geqslant f_p^{-1}.

As such, the two displays above say that the trade-off function of MSamplepM\circ\mathtt{Sample}_p on any neighboring data sets is lower bounded by min{fp,fp1},\min\{f_p, f_p^{-1}\}, which however is in general non convex and thus is not a trade-off function. This suggests that we can boost the privacy bound by replacing min{fp,fp1}\min\{f_p, f_p^{-1}\} with its double conjugate min{fp,fp1}\min\{f_p, f_p^{-1}\}^{**}, which is the greatest convex lower bound of min{fp,fp1}\min\{f_p, f_p^{-1}\} and is indeed a trade-off function. Taken together, all the pieces show that the subsampled algorithm MSamplepM\circ\mathtt{Sample}_p is min{fp,fp1}\min\{f_p, f_p^{-1}\}^{**}-DP.

Notably, the privacy bound min{fp,fp1}\min\{f_p, f_p^{-1}\}^{**} is larger than ff and cannot be improved in general. In light of the above, the ff-DP framework is flexible enough to nicely handle the analysis of privacy amplification by subsampling. In the case where the original algorithm MM is (ε,δ)(\varepsilon, \delta)-DP, this privacy bound strictly improves on the subsampling theorem for (ε,δ)(\varepsilon, \delta)-DP by Li et al. (2012).

3. Algorithms and Their Privacy Analyses

3.1. NoisySGD and NoisyAdam

SGD and Adam (Kingma & Ba, 2014) are among the most popular optimizers in deep learning. Here we introduce a new privacy analysis of a private variant of SGD in the ff-DP framework and then extend the study to a private version of Adam.

Letting S={x1,,xn}S = \{x_1, \ldots, x_n\} denote the data set, we consider minimizing the empirical risk

L(θ)=1ni=1n(θ,xi),L(\theta) = \frac1n \sum_{i=1}^n \ell(\theta, x_i),

where θ\theta denotes the weights of the neural networks and (θ,xi)\ell(\theta, x_i) is a loss function. At iteration tt, a mini-batch ItI_t is selected from {1,2,,n}\{1, 2, \ldots, n\} with subsampling probability pp, thereby having an approximate size of pnpn. Taking learning rate ηt\eta_t and initial weights θ0\theta_0, the vanilla SGD updates the weights according to

θt+1=θtηt1ItiItθ(θt,xi).\theta_{t+1} = \theta_t - \eta_t \cdot \frac1{|I_t|} \sum_{i \in I_t} \nabla_{\theta} \ell(\theta_t, x_i).

To preserve privacy, Bassily et al. (2014), Chaudhuri et al. (2011), Song et al. (2013), and Abadi et al. (2016) introduce two modifications to the vanilla SGD. First, a clip step is applied to the gradient so that the gradient is in effect bounded. This step is necessary to have a finite sensitivity. The second modification is to add Gaussian noise to the clipped gradient, which is equivalent to applying the Gaussian mechanism to the updated iterates. Formally, the private SGD algorithm is described in Algorithm 1. Herein II is the identity matrix, 2\|\cdot\|_2 denotes the 2\ell_2 norm, and we abuse notation by using TT to denote the number of iterations. Formally, we present this in Algorithm 1, which we henceforth refer to as NoisySGD. As a contribution of the present paper, NoisySGD uses the Poisson subsampling, as opposed to the uniform sampling used in Dong, Roth, et al. (2019). For completeness, we remark that there is another possible subsampling method: shuffling (randomly permuting and dividing data into folds at each epoch). We emphasize that different subsampling mechanisms produce different privacy guarantees.

Algorithm 1 NoisySGD

Input:

data set S={x1,,xn}S = \{x_1,\ldots,x_n\}, loss function (θ,x)\ell(\theta, x).

Parameters:

initial weights θ0\theta_0, learning rate ηt\eta_t, subsampling probability pp, number of iterations TT, noise scale σ\sigma, gradient norm bound RR.

for

t= 0,...,T−1 do

Take a Poisson subsample It ⊆ {1,...,n} with subsampling probability p

for i ∈ It do

vt(i)θ(θt,xi)v_t^{(i)} \gets \nabla_{\theta} \ell(\theta_t, x_i)

vˉt(i)vt(i)/max{1,vt(i)2/R}\bar{v}_t^{(i)} \gets v_t^{(i)} / \max\big\{1, \|v_t^{(i)}\|_2/R\big\}

θt+1θtηt1It(iItvˉt(i)+σRN(0,I))\theta_{t+1} \gets \theta_{t} - \eta_t \cdot \frac{1}{|I_t|} \Big(\sum_{i\in I_t}\bar{v}_t^{(i)}+\sigma R \cdot \mathcal{N}(0, I)\Big)

>Clip gradient

>Apply Gaussian mechanism

Output:

θT\theta_T

The analysis of the overall privacy guarantee of NoisySGD makes heavy use of the compositional and subsampling properties of ff-DP. We first focus on the privacy analysis of the step that computes θt+1\theta_{t+1} from θt\theta_t. Let MM denote the gradient update and write Samplep(S)\mathtt{Sample}_p(S) for the mini-batch ItI_t (we drop the subscript tt for simplicity). This allows us to use MSamplep(S)M \circ \mathtt{Sample}_p(S) to represent what NoisySGD does at each iteration. Next, note that adding or removing one individual would change the value of iItvˉt(i)\sum_{i\in I_t}\bar{v}_t^{(i)} by at most RR in the 2\ell_2 norm due to the clipping operation, that is, iItvˉt(i)\sum_{i\in I_t}\bar{v}_t^{(i)} has sensitivity RR. Consequently, the Gaussian mechanism with noise standard deviation σR\sigma R ensures that MM is 1σ\frac1\sigma-GDP. With a few additional arguments, in Appendix B we show that NoisySGD is min{f,f1}\min\{f,f^{-1}\}^{**}-DP with f=(pG1/σ+(1p)Id)Tf = \big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}. As a remark, it has been empirically observed in Abadi et al. (2016) and Bagdasaryan et al. (2019) that the performance of the private neural networks is not sensitive to the choice of the clipping norm bound RR (see more discussion in Pichapati et al., 2019, and Xu et al., 2019).

To facilitate the use of this privacy bound, we now derive an analytically tractable approximation of min{f,f1}\min\{f,f^{-1}\}^{**} using the privacy central limit theorem in a certain asymptotic regime, which further demonstrates the mathematical coherence and versatility of the ff-DP framework. The central limit theorem shows that, in the asymptotic regime where pTνp \sqrt{T}\to \nu for a constant ν>0\nu > 0 as TT \to \infty,

f=(pG1/σ+(1p)Id)TGμ,f = \big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}\to G_{\mu},

where μ=νe1/σ21\mu = \nu\sqrt{\mathrm{e}^{1/\sigma^2}-1}. Thus, the overall privacy loss in the form of the double conjugate satisfies

min{f,f1}min{Gμ,Gμ1}=Gμ=Gμ.\min\{f,f^{-1}\}^{**} \approx \min\{G_{\mu}, G_{\mu}^{-1}\}^{**} = G_{\mu}^{**} = G_{\mu}.

As such, the central limit theorem demonstrates that NoisySGD is approximately pT(e1/σ21)p\sqrt{T(\mathrm{e}^{1/\sigma^2}-1)}-GDP. Denoting by B=pnB = pn the mini-batch size, the privacy parameter pT(e1/σ21)p\sqrt{T(\mathrm{e}^{1/\sigma^2}-1)} equals BnT(e1/σ21)\frac{B}{n}\sqrt{T(\mathrm{e}^{1/\sigma^2}-1)}. Intuitively, this reveals that NoisySGD gives good privacy guarantees if BT/nB\sqrt{T}/n is small and σ\sigma is not too small.

As an aside, we remark that this new privacy analysis is different from the one performed in section 5 of Dong, Roth, et al. (2019). Therein, the authors consider Algorithm 1 with uniform subsampling and obtain a privacy bound that is different from the one in the present article.

Next, we present a private version of Adam in Algorithm 2, which we refer to as NoisyAdam and can be found in Tensorflow (2019). This algorithm has the same privacy bound as NoisySGD in the ff-DP framework. In short, this is because the momentum mtm_t and utu_t are deterministic functions of the noisy gradients and no additional privacy cost is incurred due to the post-processing property of differential privacy. In passing, we remark that the same argument applies to AdaGrad (Duchi et al., 2011) and therefore it is also asymptotically GDP in the same asymptotic regime.

Algorithm 2 NoisyAdam

Input:

data set S={x1,,xn}S = \{x_1,\ldots,x_n\}, loss function (θ,x)\ell(\theta, x).

Parameters:

initial weights θ0\theta_0, learning rate ηt\eta_t, subsampling probability pp, number of iterations TT, noise scale σ\sigma, gradient norm bound RR, momentum parameters (β1,β2)(\beta_1,\beta_2), initial momentum m0m_0, initial past squared gradient u0u_0, and a small constant ξ>0\xi > 0.

for

t = 0,...,T− 1 do

Take a Poisson subsample It ⊆ {1,...,n} with subsample probability p

for i ∈ It do

vt(i)θ(θt,xi)v_t^{(i)} \gets \nabla_{\theta} \ell(\theta_t, x_i)

vˉt(i)vt(i)/max{1,vt(i)2/R}\bar{v}_t^{(i)} \gets v_t^{(i)} / \max\big\{1, \|v_t^{(i)}\|_2/R\big\}

>Clip gradient

v~t1It(iItvˉt(i)+σRN(0,I))\tilde{v}_t\gets\frac{1}{|I_t|}\left(\sum_{i\in I_t}\bar{v}_t^{(i)}+\sigma R\cdot \mathcal{N}(0,I)\right)

mtβ1mt1+(1β1)v~tm_t\gets\beta_1 m_{t-1}+(1-\beta_1) \tilde{v}_t

utβ2ut1+(1β2)(v~tv~t)u_t\gets\beta_2 u_{t-1}+(1-\beta_2) \big(\tilde{v}_t\odot\tilde{v}_t\big)

wtmt/(ut+ξ)w_t \gets m_t/\left(\sqrt{u_t}+\xi\right)

θt+1θtηtwt\theta_{t+1} \gets \theta_{t} - \eta_t w_t

>Apply Gaussian mechanism

>\odot is the Hadamard product

>Component-wise division

Output:

θT\theta_T

3.2. Comparisons With the Moments Accountant

It is instructive to compare the moments accountant with our privacy analysis performed in Section 3.1 using the ff-DP framework. Developed in Abadi et al. (2016), the moments accountant gives a tight one-to-one mapping between ε\varepsilon and δ\delta for specifying the overall privacy loss in terms of (ε,δ)(\varepsilon, \delta)-DP under composition, which is beyond the reach of the advanced composition theorem by Dwork et al. (2010). In slightly more detail, the moments accountant uses the moment-generating function of the privacy loss random variable to track the privacy loss under composition. As abuse of notation, this article uses functions δMA=δMA(ε)\delta_{\mathtt{MA}} = \delta_{\mathtt{MA}}(\varepsilon) and εMA=εMA(δ)\varepsilon_{\mathtt{MA}} =\varepsilon_{\mathtt{MA}}(\delta) to denote the mapping induced by the moments accountant in both directions6. For self-containedness, the appendix includes a formal description of the two functions.

Although NoisySGD and NoisyAdam are our primary focus, the following discussion applies to general iterative algorithms where composition must be addressed in the privacy analysis. Let algorithm MtM_t be ftf_t-DP for t=1,,Tt = 1, \ldots, T and write MM for their composition. On the one hand, the moments accountant technique ensures that MM is (ε,δMA(ε))(\varepsilon, \delta_{\mathtt{MA}}(\varepsilon))-DP for any ε\varepsilon or, put equivalently,7 is (εMA,δ)(\varepsilon_{\mathtt{MA}}, \delta)-DP. On the other hand, the composition algorithm is f1fTf_1 \otimes \cdots \otimes f_T-DP from the ff-DP viewpoint and, following from the central limit theorem (2.2), this composition can be shown to be approximately GDP in a certain asymptotic regime. For example, both NoisySGD and NoisyAdam presented in and Algorithm 1 and Algorithm 2, respectively, asymptotically satisfy μCLT\mu_{\mathtt{CLT}}-GDP with privacy parameter

(3.1)μCLT=pT(e1/σ21).(3.1) \qquad \qquad \qquad \mu_{\mathtt{CLT}} = p\sqrt{T(\mathrm{e}^{1/\sigma^2} - 1)}.

In light of the above, it is tempting to ask which of the two approaches yields a sharper privacy analysis. In terms of ff-DP guarantees, it must be the latter, which we refer to as the CLT approach, because the composition theorem of ff-DP is tight and, more importantly, the privacy central limit theorem is asymptotically exact. To formally state the result, note that the moments accountant asserts that the private optimizer is (ε,δMA(ε))(\varepsilon, \delta_{\mathtt{MA}}(\varepsilon))-DP for all ε0\varepsilon\geqslant 0, which is equivalent to supε0fε,δMA(ε)\sup_{\varepsilon\geqslant 0} f_{\varepsilon, \delta_{\mathtt{MA}}(\varepsilon)}-DP by recognizing (2.1) (see also Proposition 2.11 in Dong, Roth, et al., 2019). Roughly speaking, the following theorem says that supε0fε,δMA(ε)\sup_{\varepsilon\geqslant 0} f_{\varepsilon, \delta_{\mathtt{MA}}(\varepsilon)}-DP (asymptotically) promises no more privacy guarantees than the bound of μCLT\mu_{\mathtt{CLT}}-GDP given by the CLT approach. This simple result is summarized by the following theorem and see Appendix B for a formal proof of this result.

Theorem 1 (Comparison in ff-DP). Assume that pTp\sqrt{T} converges to a positive constant as TT \to \infty. Then, both NoisySGD\mathtt{NoisySGD} and NoisyAdam\mathtt{NoisyAdam} satisfy

lim supT(supε0fε,δMA(ε)(α)GμCLT(α))0\limsup_{T \to \infty} \left( \sup_{\varepsilon\geqslant 0} f_{\varepsilon, \delta_{\mathtt{MA}}(\varepsilon)}(\alpha) - G_{\mu_{\mathtt{CLT}}}(\alpha) \right) \leqslant 0

for every 0α10 \leqslant\alpha \leqslant 1.

Remark 1. For ease of reading, we point out that, in the (ε,δ)(\varepsilon,\delta)-DP framework, the smaller ε,δ\varepsilon, \delta are, the more privacy is guaranteed. In contrast, in the ff-DP framework, the smaller ff is, the less privacy is guaranteed.

From the (ε,δ)(\varepsilon, \delta)-DP viewpoint, however, the question as to which approach gives a sharper privacy analysis is presently unclear. Explicitly, the duality between ff-DP and (ε,δ)(\varepsilon, \delta)-DP shows that μ\mu-GDP implies (ε,δ(ε;μ))(\varepsilon, \delta(\varepsilon; \mu))-DP for all ε0\varepsilon\geqslant 0, where8

(3.2)δ(ε;μ)=1+Gμ(eε)=Φ(εμ+μ2)eεΦ(εμμ2).(3.2) \qquad\quad \delta(\varepsilon; \mu) = 1 + G^*_{\mu}(-\mathrm{e}^{\varepsilon}) = \Phi( -\frac{\varepsilon}{\mu} +\frac{\mu}{2})- \mathrm{e}^{\varepsilon}\Phi(- \frac{\varepsilon}{\mu} - \frac{\mu}{2}).

Answering the question is, therefore, reduced to the comparison between δMA(ε)\delta_{\mathtt{MA}}(\varepsilon) and δCLT(ε):=δ(ε;μCLT)\delta_{\mathtt{CLT}}(\varepsilon) := \delta(\varepsilon; \mu_{\mathtt{CLT}}) or, equivalently9, between εMA(δ)\varepsilon_{\mathtt{MA}}(\delta) and εCLT(δ):=ε(δ;μCLT)\varepsilon_{\mathtt{CLT}}(\delta) := \varepsilon(\delta; \mu_{\mathtt{CLT}}).

Theorem 2 (Comparison in (ε,δ)(\varepsilon,\delta)-DP). Under the assumptions of Theorem 1, the ff-DP framework gives an asymptotically sharper privacy analysis of both NoisySGD\mathtt{NoisySGD} and NoisyAdam\mathtt{NoisyAdam} than the moments accountant in terms of (ε,δ)(\varepsilon, \delta)-DP. That is,

lim supT(δCLT(ε)δMA(ε))<0\limsup_{T \to \infty} \, \left( \delta_{\mathtt{CLT}}(\varepsilon) - \delta_{\mathtt{MA}}(\varepsilon) \right) < 0 %\mathtt{Dual}(\epsilon; \mu) < \mathtt{MA}(\epsilon; f_1 \otimes \cdots \otimes f_k)

for all ε0\varepsilon\geqslant 0.

In words, the CLT approach in the ff-DP framework allows for an asymptotically smaller δ\delta than the moments accountant at the same ε\varepsilon. It is worthwhile mentioning that the inequality in this theorem holds for any finite TT if δ\delta is derived by directly applying the duality to the (exact) privacy bound f1fTf_1 \otimes \cdots \otimes f_T. Equivalently, the theorem says that lim supT(εCLT(δ)εMA(δ))<0\limsup_{T \to \infty} \, \left( \varepsilon_{\mathtt{CLT}}(\delta) - \varepsilon_{\mathtt{MA}}(\delta) \right) < 0 for10 any δ\delta. As such, by setting the same δ\delta in both approaches, say δ=105\delta = 10^{-5}, the ff-DP based CLT approach shall give a smaller value of ε\varepsilon.

From a practical viewpoint, this refined privacy analysis allows us to trade privacy guarantees for improvement in utility. More precisely, recognizing the aforementioned conclusion that δ(ε;μCLT)δCLT(ε)<δMA(ε)\delta(\varepsilon; \mu_{\mathtt{CLT}}) \equiv \delta_{\mathtt{CLT}}(\varepsilon) < \delta_{\mathtt{MA}}(\varepsilon) (for sufficiently large TT) and that δ(ε;μ)\delta(\varepsilon; \mu) increases as μ\mu increases, one can find μ~CLT>μCLT\tilde\mu_{\mathtt{CLT}} > \mu_{\mathtt{CLT}} such that

(3.3)δ(ε;μ~CLT)=δMA(ε).(3.3) \qquad\qquad\qquad\qquad \delta(\varepsilon; \tilde\mu_{\mathtt{CLT}}) = \delta_{\mathtt{MA}}(\varepsilon).

Put differently, we can carefully adjust some parameters in Algorithm 1 and Algorithm 2 in order to let the algorithms be μ~CLT\tilde\mu_{\mathtt{CLT}}-GDP. For example, we can reduce the scale of the added noise from σ\sigma to a certain σ~<σ\tilde\sigma < \sigma, which can be solved from (3.3) and

(3.4)μ~CLT=pT(e1/σ~21).(3.4) \qquad\qquad\qquad\qquad \tilde\mu_{\mathtt{CLT}} = p\sqrt{T(\mathrm{e}^{1/\tilde\sigma^2} - 1)}.

Note that this is adapted from (3.1).

Figure 1. An illustration of the CLT approach in the ff-DP framework and the moments accountant in the (ε,δ)-DP framework. NoisyOptimizer(σ,...), using the moments accountant, gives the same privacy guarantees in terms of (ε,δ)-DP as NoisyOptimizer(σ~\tilde\sigma,...) using the CLT approach (the ellipses denote omitted parameters). Note that the duality formula (3.2) is used in solving μ~CLT\tilde\mu_{\mathtt{CLT}} from (3.3).

Figure 1 shows the flowchart of the privacy analyses using the two approaches and their relationship. In addition, numerical comparisons are presented in Figure 2, consistently demonstrating the superiority of the CLT approach.

Figure 2. Tradeoffs between ε and δ for both CLT and MA, which henceforth denotes the moments accountant. The settings follow the MNIST experiment in Section 4 with σ = 0.7, p = 256/60000. The bottom two plots assume δ = 10 − 5. Note ε and δ in the CLT are related via (3.2) with μ = μCLT. The bottom right plot is consistent with the conclusion σ > σ̃ shown in the cloud icon of Figure 1.

4. Results

In this section, we use NoisySGD and NoisyAdam to train private deep learning models on data sets for tasks ranging from image classification (MNIST), text classification (IMDb movie review), recommender systems (MovieLens movie rating), to regular binary classification (Adult income). Note that these data sets all contain sensitive information about individuals, and this fact necessitates privacy consideration in the training process. Code to reproduce the results using the TensorFlow Privacy library is available at https://github.com/tensorflow/privacy/tree/master/research/GDP_2019.

4.1. The ff-DP Perspective

This section demonstrates the utility and practicality of the private deep learning methodologies with associated privacy guarantees in terms of ff-DP. In Section 4.2, we extend the empirical study to the (ε,δ)(\varepsilon, \delta)-DP framework. Throughout the experiments, the parameter δ\delta we use always satisfies δ<1/n\delta < 1/n, where nn is the number of training examples.

MNIST

The MNIST data set, created by LeCun & Cortes (2010), contains 60,000 training images and 10,000 test images. Each image is in 28×2828\times 28 gray-scale representing a handwritten digit ranging from 0 to 9. We train neural networks with the same architecture (two convolutional layers followed by one dense layer) as in Tensorflow (2019) and Abadi et al. (2016) on this data set. Throughout the experiment, we set the subsampling probability to p=256/60000p=256/60000 and use a constant learning rate η\eta.

Table 1 displays the test accuracy of the neural networks trained by NoisySGD as well as the associated privacy analyses. The privacy parameters ε\varepsilon in the last two columns are both with respect to δ=105\delta=10^{-5}. Over all six sets of experiments with different tuning parameters, the CLT approach gives a significantly smaller value of ε\varepsilon than the moments accountant, which is consistent with our discussion in Section 3.2. The point we wish to emphasize, however, is that ff-DP offers a much more comprehensive interpretation of the privacy guarantees than (ε,δ)(\varepsilon, \delta)-DP. For instance, the model from the third row preserves a decent amount of privacy since it is not always easy to distinguish N(0,1)\mathcal{N}(0,1) and N(1.13,1)\mathcal{N}(1.13, 1). In stark contrast, the (ε,δ)(\varepsilon,\delta)-DP viewpoint is too conservative, suggesting that for the same model not much privacy is left, due to a very large ‘likelihood ratio’ eε\mathrm{e}^{\varepsilon} in Definition 2.1: it equals e7.10=1212.0\mathrm{e}^{7.10} = 1212.0 or e5.07=159.1\mathrm{e}^{5.07} = 159.1 depending on which approach is chosen. This shortcoming of (ε,δ)(\varepsilon, \delta)-DP cannot be overcome by taking a larger δ\delta, which, although giving rise to a smaller ε\varepsilon, would undermine the privacy guarantee from a different perspective.

Table 1. Experimental Results for NoisySGD and Their Privacy Analyses on MNIST. Note. The accuracy is averaged over 10 independent runs. The hyperparameters in the first three rows are the same as in Tensorflow (2019). The μ\mu in the 6th column is calculated using (3.1), which carries over to the 7th column via (3.2) with δ=105\delta=10^{-5}. The number of epochs is equal to T×mini-batch size/n=pTT \times \text{mini-batch size}/n = pT.

η\eta

RR

σ\sigma

Epochs

Test accuracy (%)

CLT μ\mu

CLT ε\varepsilon

MA ε\varepsilon

0.25

1.5

1.3

15

95.0

0.23

0.83

1.19

0.15

1.0

1.1

60

96.6

0.57

2.32

3.01

0.25

1.5

0.7

45

97.0

1.13

5.07

7.10

0.25

1.5

0.6

62

97.6

2.00

9.98

13.27

0.25

1.5

0.55

68

97.8

2.76

14.98

18.72

0.25

1.5

0.5

100

98.0

4.78

31.12

32.40

For all experiments described in Table 1, Figure 3 illustrates the privacy bounds given by the CLT approach and the moments accountant both in terms of trade-off functions. The six plots in the first and third rows are with respect to δ=105\delta = 10^{-5}, from which the ff-DP framework is seen to provide an analyst with substantial improvements in the privacy bounds. Note that the first row in Figure 3 corresponds to the first three rows in Table 1, and the third row in Figure 3 corresponds to the last three rows in Table 1. For the model corresponding to 96.6% test accuracy, concretely, the minimum sum of type I and type II errors in the sense of hypothesis testing is (at least) 77.6%77.6\% by the CLT approach, whereas it is merely (at least) 9.4%9.4\% by the moments accountant. For completeness, we show the optimal trade-off functions over all pairs of ε,δ\varepsilon, \delta given by the moments accountant in the middle row. The gaps between the two approaches exist, as predicted by Theorem 1, and remain significant.

Figure 3. Comparisons between the two ways of privacy analysis on MNIST in terms of the trade-off between type I and type II errors, in the same setting as Table 1. The plots are different from Figure 7 in Dong, Roth, et al. (2019). The (ϵ,δ\epsilon, \delta)-DP guarantees are plotted according to (2.1). The blue regions in the plots from the second row correspond to all pairs of (ϵ,δ\epsilon,\delta) computed by MA\texttt{MA}. The blue regions are not noticeable in the third row.

Next, we extend our experiments to other data sets to further test ff-DP for training private neural networks. The experiments compare private models under the privacy budget μ2\mu \leq 2 to their non-private counterparts and some popular baseline methods. For simplicity, we focus on shallow neural networks and leave the investigation of complex architectures for future research.

Adult income.

Originally from the UCI repository in Dua and Graff (2017), the Adult income data set has been preprocessed into the LIBSVM format by Chang and Lin (2011). This data set contains 32,561 examples, each of which has 123 features and a label indicating whether the individual’s annual income is more than $50,000 or not. We randomly choose 10%10\% of the examples as the test set (3,256 examples) and use the remaining 29,305 examples as the training set.

Our model is a single-layer multi-perceptron with 16 neurons and the ReLU activation. We set σ=0.55\sigma=0.55, p=256/29305p=256/29305, η=0.15,R=1\eta=0.15, R=1, and use NoisySGD as our optimizer. The results displayed in Table 2 show that our private model achieves performance comparable to the baselines in the MLC++ library by Kohavi et al. (1996) in terms of test accuracy.

Table 2. Results for NoisySGD on the Adult Income Data Set. Note. The ε\varepsilon parameters are with respect to δ=105\delta=10^{-5}.

Models

Epochs

Test accuracy (%)

CLT μ\mu

CLT ε\varepsilon

MA ε\varepsilon

private networks

18

84.0

2.03

10.20

14.70

non-private networks

20

84.5

kkNN (Cover & Hart, 1967)

79.7

naive Bayes

83.9

voted ID3 (Quinlan, 1986)

84.4

C4.5 (Quinlan, 2014)

84.5

IMDb.

We use the IMDb movie review data set from Maas et al. (2011) for binary sentiment classification (positive or negative reviews). The data set contains 25,000 training and 25,000 test examples. In our experiments, we preprocess the data set by including only the top 10,000 frequently used words and discard the rest. Next, we set every example to have 256 words by truncating the length or filling with zeros if necessary.

In our neural networks, the input is first embedded into 16 units and then is passed through a global average pooling. The intermediate output is fed to a fully-connected layer with 16 neurons, followed by a ReLU layer. We set σ=0.56\sigma=0.56, p=512/25000p=512/25000, η=0.02,R=1\eta=0.02, R=1, and use NoisyAdam as our optimizer, which is observed to converge much faster than NoisySGD in this training task. We use the (non private) two-layer LSTM RNN model in the Tensorflow tutorials as a baseline model. Table 3 reports the experimental results. Notably, the private neural networks perform comparably to the baseline model, at a cost of only one percent drop in test accuracy compared to the non-private counterpart.

Table 3. Results for NoisyAdam on the IMDB Data Set, With δ=105\delta=10^{-5} Used in the Privacy Analyses.

Models

Epochs

Test accuracy (%)

CLT μ\mu

CLT ε\varepsilon

MA ε\varepsilon

private networks

9

83.8

2.07

10.43

15.24

non-private networks

20

84.7

LSTM-RNN

10

85.4

(Hochreiter & Schmidhuber, 1997)

MovieLens.

The MovieLens movie rating data set from Harper and Konstan (2016) is a benchmark data set for recommendation tasks. Our experiments consider the MovieLens 1M data set, which contains 1,000,209 movie ratings from 1 star to 5 stars. In total, there are 6,040 users who rated 3,706 different movies. For this multi-class classification problem, the root mean squared error (RMSE) is chosen as the performance measure. It is worthwhile to mention that, as each user watched only a small fraction of all the movies, most (user, movie) pairs correspond to missing ratings. We randomly sample 20% of the examples as the test set and take the remainder as the training set.

Our model is a simplified version of the neural collaborative filtering in He et al. (2017). The network architecture consists of two branches. The left branch applies generalized matrix factorization to embed the users and movies using five latent factors. The output of the user embedding is multiplied by the item embedding. In the right branch, we use 10 latent factors for embedding. The embedding from both branches are then concatenated, which is fed to a fully connected output layer. We set σ=0.6,p=1/80,η=0.01\sigma=0.6, p=1/80, \eta=0.01, and R=5R=5 in NoisyAdam.

Table 4. Results for NoisyAdam on the MovieLens 1M Data Set, With δ=106\delta=10^{-6} Used in the Privacy Analyses. Note. CF stands for collaborative filtering.

Models

Epochs

RMSE

CLT μ\mu

CLT ε\varepsilon

MA ε\varepsilon

private networks

20

0.915

1.94

10.61

15.39

non-private networks

20

0.893

SVD

0.873

NMF

0.916

user-based CF (Sarwar et al., 2001)

0.923

global average

1.117

Table 4 presents the numerical results of our neural networks as well as baseline models in the Surprise library from Hug (2017) in their default settings. The difference in RMSE between the non-private networks and the private one is relatively large for the MovieLens 1M data set. Nevertheless, the private model still outperforms many popular non-private models, including the user-based collaborative filtering and nonnegative matrix factorization.

4.2. The (ε,δ)(\varepsilon, \delta)-DP Perspective

While we hope that the ff-DP perspective has been conclusively demonstrated to be advantageous, this section shows that the CLT approach continues to bring considerable benefits even in terms of (ε,δ)(\varepsilon, \delta)-DP. Specifically, by making use of the comparisons between the CLT approach and the moments accountant in Section 3.2, we can add less noise to the gradients in NoisySGD and NoisyAdam while achieving the same (ε,δ)(\varepsilon, \delta)-DP guarantees provided by the moments accountant. With less added noise, conceivably, an optimizer would have a higher prediction accuracy.

Figure 4 illustrates the experimental results on MNIST. In the top two plots, we set the noise scales to σ=1.3,σ~=1.06\sigma=1.3, \tilde\sigma=1.06, which are both shown to give (1.34,105)(1.34, 10^{-5})-DP at epoch 20 using the moments accountant and the CLT approach, respectively. The test accuracy associated with the CLT approach is almost always higher than that associated with the moments accountant. In addition, another benefit of taking the CLT approach is that it gives rise to stronger privacy protection before reaching epoch 20, as shown by the right plot. For the bottom plots, although the improvement in test accuracy at the end of training is less significant, the CLT approach leads to much faster convergence at early epochs. To be concrete, the numbers of epochs needed to achieve 95%,96%95\%, 96\%, and 97%97\% test accuracy are 18,2618, 26, and 4545, respectively, for the neural networks with less noise, whereas the numbers of epochs are 23,3323, 33, and 6464, respectively, using a noise level that is computed by the moments accountant. In a similar vein, the moments accountant gives a test accuracy of 92% for the first time when ε=4\varepsilon= 4 and the CLT approach achieves 96% under the same privacy budget.

Figure 4. Experimental results from one run of NoisySGD on MNIST with different noise scales but the same (ε, δ)-DP guarantees. The top plots use p = 256/60000, η = 0.15, R = 1.5, and σ = 1.3, σ̃ = 1.06. The CLT approach with σ̃ = 1.06 and the moments accountant with σ = 1.3 give (1.34, 10 − 5)-DP at the 20th epoch (μCLT = 0.35). The bottom plots use the same parameters except for σ = 0.7, σ̃ = 0.638, and η = 0.15. Both approaches give (8.68, 10 − 5)-DP at epoch 70 (μCLT = 1.78). The right plots show the privacy loss during the training process in terms of the ε spending with respect to δ = 10 − 5.

5. Discussion

In this article, we have showcased the use of ff-DP, a very recently proposed privacy definition, for training private deep learning models using SGD or Adam. Owing to its strength in handling composition and subsampling and the powerful privacy central limit theorem, the ff-DP framework allows for a closed-form privacy bound that is sharper than the one given by the moments accountant in the (ε,δ)(\varepsilon, \delta)-DP framework. Employing numerical experiments, we show that the trained neural networks can be quite private from the ff-DP viewpoint (for instance, 1.131.13-GDP11) but are not in the (ε,δ)(\varepsilon, \delta)-DP sense due to overly conservative privacy bounds (for instance, (7.10,105)(7.10, 10^{-5})-DP) computed in the (ε,δ)(\varepsilon, \delta)-DP framework. This in turn suggests that one can add less noise during the training process while having the same privacy guarantees as using the moments accountant, thereby improving model utility.

We conclude this article by offering several directions for future research. As the first direction, we may consider using time-dependent noise scales and learning rates in NoisySGD and NoisyAdam for a better tradeoff between privacy loss and utility in the ff-DP framework. Note that Lee and Kifer (2018) made considerable progress using concentrated differential privacy along this line. More generally, a straightforward but interesting problem is to extend this work to complex neural network architectures with a variety of optimization strategies. For example, can we develop some guidelines for choosing an optimizer among NoisySGD, NoisyAdam, and others for a given classification problem under some privacy constraint? Empirically, deep learning models are very sensitive to hyperparameters such as mini-batch size in terms of test accuracy. Therefore, from a practical standpoint, it would be of great importance to incorporate hyperparameter tuning into the ff-DP framework (see Gupta et al., 2010). Inspired by Lecuyer et al. (2019), another interesting direction is to explore the possible relationship between ff-DP guarantees and adversarial robustness of neural networks. Given ff-DP’s good interpretability and powerful toolbox, it is worthwhile investigating whether, from a broad perspective, its superiority over earlier differential privacy relaxations would hold in general private statistical and machine learning tasks. We look forward to more research efforts to further the theory and extend the use of ff-DP.


Disclosure Statement

The authors have no conflicts of interest to declare.

Acknowledgments

We are grateful to David Durfee, Ryan Rogers, Aaron Roth, and Qinqing Zheng for stimulating discussions in the early stages of this work. We would also like to thank two anonymous referees for their constructive comments that improved the presentation of the paper. This work was supported in part by NSF through CAREER DMS-1847415, CCF-1763314, and CCF-1934876, the Wharton Dean’s Research Fund, and NIH through R01GM124111 and RF1AG063481.


References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318. https://dl.acm.org/doi/abs/10.1145/2976749.2978318

Abowd, J. M. (2018). The US Census Bureau adopts differential privacy. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2867–2867. https://dl.acm.org/doi/abs/10.1145/3219819.3226070

Bagdasaryan, E., Poursaeed, O., & Shmatikov, V. (2019). Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/9681-differential-privacy-has-disparate-impact-on-model-accuracy

Balle, B., Barthe, G., & Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in Neural Information Processing Systems, 6280–6290. http://papers.nips.cc/paper/7865-privacy-amplification-by-subsampling-tight-analyses-via-couplings-and-divergences

Balle, B., Barthe, G., Gaboardi, M., Hsu, J., & Sato, T. (2019). Hypothesis testing interpretations and Renyi differential privacy. ArXiv Preprint ArXiv:1905.09982.

Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 464–473. https://ieeexplore.ieee.org/abstract/document/6979031

Bun, M., Dwork, C., Rothblum, G. N., & Steinke, T. (2018). Composable and versatile privacy via truncated CDP. Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 74–86. https://dl.acm.org/doi/abs/10.1145/3188745.3188946

Bun, M., & Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. Theory of Cryptography Conference, 635–658. https://dl.acm.org/citation.cfm?id=3081298

Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST)2(3), 1–27. https://dl.acm.org/doi/abs/10.1145/1961189.1961199

Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research12(Mar), 1069–1109. http://www.jmlr.org/papers/volume12/chaudhuri11a/chaudhuri11a.pdf

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory13(1), 21–27. https://ieeexplore.ieee.org/abstract/document/1053964/?casa_token=WyLTsD8whAQAAAAA:RHP-SXY–lPWvpEaGfHcQjugplhzAE9JDgFgPTbOMGX0lXj2Ieg23srXIXSlW3dHGHLzrBk5nw

Differential-Privacy-Team. (2017). Learning with privacy at scale. Apple. https://machinelearning.apple.com/research/learning-with-privacy-at-scale

Ding, B., Kulkarni, J., & Yekhanin, S. (2017). Collecting telemetry data privately. Proceedings of Advances in Neural Information Processing Systems 30 (NIPS 2017). http://papers.nips.cc/paper/6948-collecting-telemetry-data-privately

Dong, J., Durfee, D., & Rogers, R. (2019). Optimal differential privacy composition for exponential mechanisms and the cost of adaptivity. ArXiv Preprint ArXiv:1909.13830.

Dong, J., Roth, A., & Su, W. J. (2019). Gaussian differential privacy. ArXiv Preprint ArXiv:1905.02383.

Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research12(Jul), 2121–2159. http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006). Our data, ourselves: Privacy via distributed noise generation. Annual International Conference on the Theory and Applications of Cryptographic Techniques, 486–503. https://link.springer.com/chapter/10.1007/11761679_29

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference, 265–284. https://link.springer.com/chapter/10.1007/11681878_14

Dwork, C., & Rothblum, G. N. (2016). Concentrated differential privacy. ArXiv Preprint ArXiv:1603.01887.

Dwork, C., Rothblum, G. N., & Vadhan, S. (2010). Boosting and differential privacy. 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS), 51–60. https://ieeexplore.ieee.org/abstract/document/5670947/

Erlingsson, Ú., Pihur, V., & Korolova, A. (2014). RAPPOR: Randomized aggregatable privacy-preserving ordinal response. Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 1054–1067. https://dl.acm.org/doi/abs/10.1145/2660267.2660348

Gupta, A., Ligett, K., McSherry, F., Roth, A., & Talwar, K. (2010). Differentially private combinatorial optimization. Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, 1106–1125. https://epubs.siam.org/doi/abs/10.1137/1.9781611973075.90

Harper, F. M., & Konstan, J. A. (2016). The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS)5(4), 19. https://dl.acm.org/doi/abs/10.1145/2827872

He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th International Conference on World Wide Web, 173–182. https://dl.acm.org/doi/abs/10.1145/3038912.3052569

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation9(8), 1735–1780. https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735

Hug, N. (2017). Surprise, a Python library for recommender systems. http://surpriselib.com/

Kairouz, P., Oh, S., & Viswanath, P. (2017). The composition theorem for differential privacy. IEEE Transactions on Information Theory63(6), 4037–4049. http://proceedings.mlr.press/v37/kairouz15.html

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ArXiv Preprint ArXiv:1412.6980.

Kohavi, R., Sommerfield, D., & Dougherty, J. (1996). Data mining using/spl Mscr//spl Lscr//spl Cscr/++ a machine learning library in C++. Proceedings Eighth IEEE International Conference on Tools with Artificial Intelligence, 234–245. https://ieeexplore.ieee.org/abstract/document/560457/

LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., & Jana, S. (2019). Certified robustness to adversarial examples with differential privacy. 2019 IEEE Symposium on Security and Privacy (SP), 656–672. https://ieeexplore.ieee.org/abstract/document/8835364/

Lee, J., & Kifer, D. (2018). Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1656–1665. https://dl.acm.org/doi/abs/10.1145/3219819.3220076

Li, N., Qardaji, W., & Su, D. (2012). On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, 32–33. https://dl.acm.org/doi/abs/10.1145/2414456.2414474

Liu, C., He, X., Chanyaswad, T., Wang, S., & Mittal, P. (2019). Investigating statistical privacy frameworks from the perspective of hypothesis testing. Proceedings on Privacy Enhancing Technologies2019(3), 233–254. https://content.sciendo.com/view/journals/popets/2019/3/article-p233.xml

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. https://www.aclweb.org/anthology/P11-1015.pdf

McMahan, B., Andrew, G., Mironov, I., Papernot, N., Kairouz, P., Chien, S., & Erlingsson, Ú. (2018). A general approach to adding differential privacy to iterative training procedures. ArXiv Preprint ArXiv:1812.06210.

McMahan, B., Ramage, D., Talwar, K., & Zhang, L. (2017). Learning differentially private recurrent language models. ArXiv Preprint ArXiv:1710.06963.

Mironov, I. (2017). Rényi differential privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF), 263–275. https://ieeexplore.ieee.org/abstract/document/8049725

Murtagh, J., & Vadhan, S. (2016). The complexity of computing the optimal composition of differential privacy. Theory of Cryptography Conference, 157–175. https://link.springer.com/chapter/10.1007/978-3-662-49096-9_7

Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. 2008 IEEE Symposium on Security and Privacy, 111–125. https://ieeexplore.ieee.org/abstract/document/4531148/

Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. ArXiv Preprint ArXiv:1610.05755.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., & Erlingsson, Ú. (2018). Scalable private learning with PATE. ArXiv Preprint ArXiv:1802.08908.

Pichapati, V., Suresh, A. T., Yu, F. X., Reddi, S. J., & Kumar, S. (2019). AdaCliP: Adaptive clipping for private SGD. ArXiv Preprint ArXiv:1908.07643.

Quinlan, R. (1986). Induction of decision trees. Machine Learning1(1), 81–106. https://link.springer.com/article/10.1007/BF00116251

Quinlan, R. (2014). C4. 5: Programs for machine learning. Elsevier.

Rockafellar, R. T. (1970). Convex analysis (Vol. 28). Princeton University Press.

Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. WWW1, 285–295. https://dl.acm.org/doi/pdf/10.1145/371920.372071

Shokri, R., & Shmatikov, V. (2015). Privacy-preserving deep learning. Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 1310–1321. https://dl.acm.org/doi/abs/10.1145/2810103.2813687

Sommer, D., Meiser, S., & Mohammadi, E. (2018). Privacy loss classes: The Central Limit Theorem in differential privacy. Proceedings on Privacy Enhancing Technologies 2019 (2), 245-269. https://content.sciendo.com/view/journals/popets/2019/2/article-p245.xml

Song, S., Chaudhuri, K., & Sarwate, A. D. (2013). Stochastic gradient descent with differentially private updates. 2013 IEEE Global Conference on Signal and Information Processing, 245–248. https://ieeexplore.ieee.org/abstract/document/6736861/

Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics25(2–3), 98–110. https://journals.sagepub.com/doi/abs/10.1111/j.1748-720x.1997.tb01885.x

Tensorflow. (2019). Tensorflow Privacy library. http://github.com/tensorflow/privacy

Wang, Y.-X., Balle, B., & Kasiviswanathan, S. P. (2019). Subsampled Renyi differential privacy and analytical moments accountant. The 22nd International Conference on Artificial Intelligence and Statistics, 1226–1235. http://proceedings.mlr.press/v89/wang19b.html

Wasserman, L., & Zhou, S. (2010). A statistical framework for differential privacy. Journal of the American Statistical Association105(489), 375–389. https://amstat.tandfonline.com/doi/abs/10.1198/jasa.2009.tm08651

Xiang, L., Yang, J., & Li, B. (2019). Differentially-private deep learning from an optimization perspective. IEEE INFOCOM 2019-IEEE Conference on Computer Communications, 559–567. https://ieeexplore.ieee.org/abstract/document/8737494/

Xu, Z., Shi, S., Liu, A. X., Zhao, J., & Chen, L. (2019). An adaptive and fast convergent approach to differentially private deep learning. ArXiv Preprint ArXiv:1912.09150.

Zhang, J., Zheng, K., Mou, W., & Wang, L. (2017). Efficient private ERM for smooth objectives. Proceedings of the 26th International Joint Conference on Artificial Intelligence, 3922–3928. https://www.ijcai.org/Proceedings/2017/0548.pdf


Appendix A. Omitted Details in Section 2

We present the inequality (2.3) as the following proposition, which is given in Section 2 but not in the foundational work of Dong, Roth, et al. (2019).

Proposition A.1. If MM is ff-DP, and S=S{x0}S'=S\cup\{x_0\}, then

T(MSamplep(S),MSamplep(S))pf+(1p)Id.T\big(M\circ \mathtt{Sample}_p(S), M\circ \mathtt{Sample}_p(S')\big)\geqslant pf+(1-p)\mathrm{Id}.

Proof of Proposition A.1. We first write the two distributions MSamplep(S)M\circ \mathtt{Sample}_p(S) and MSamplep(S)M\circ \mathtt{Sample}_p(S') as mixtures.

Without loss of generality, we can assume S={x1,,xn}S=\{x_1,\ldots, x_n\} and S={x0,x1,,xn}S'=\{x_0,x_1,\ldots, x_n\}. An outcome of the process Samplep\mathtt{Sample}_p when applied to SS is a bit string b=(b1,,bn){0,1}n\vec{b} = (b_1,\ldots, b_n)\in \{0,1\}^n. Bit bib_i depends on whether xix_i is selected into the subsample. We use SbSS_{\vec{b}}\subseteq S to denote the subsample determined by b\vec{b}. When each bib_i is sampled from a Bernoulli(p)(p) distribution independently, SbS_{\vec{b}} can be identified with Samplep(S)\mathtt{Sample}_p(S). Let θb\theta_{\vec{b}} be the probability that b\vec{b} appears. More specifically, if kk out of nn entries of b\vec{b} is one, then θb=pk(1p)nk\theta_{\vec{b}}=p^k(1-p)^{n-k}. With this notation, MSamplep(S)M\circ \mathtt{Sample}_p(S) can be written as the following mixture:

MSamplep(S)=b{0,1}nθbM(Sb).M\circ \mathtt{Sample}_p(S) = \sum_{\vec{b}\in\{0,1\}^n} \theta_{\vec{b}}\cdot M(S_{\vec{b}}).

Similarly, MSamplep(S)M\circ \mathtt{Sample}_p(S) can also be written as a mixture, with an additional bit indicating the presence of x0x_0. Alternatively, we can divide the components into two groups: one with x0x_0 present, and the other with x0x_0 absent. Namely,

MSamplep(S)=b{0,1}npθbM(Sb{x0})+b{0,1}n(1p)θbM(Sb).M\circ \mathtt{Sample}_p(S') = \sum_{\vec{b}\in\{0,1\}^{n}} p\cdot\theta_{\vec{b}}\cdot M(S_{\vec{b}}\cup \{x_0\})+\sum_{\vec{b}\in\{0,1\}^{n}} (1-p)\cdot\theta_{\vec{b}}\cdot M(S_{\vec{b}}).

Note that Sb{x0}S_{\vec{b}}\cup \{x_0\} and SbS_{\vec{b}} are neighbors, i.e,  MSamplep(S)M\circ \mathtt{Sample}_p(S') is the mixture of neighboring distributions. The following lemma is the perfect tool to deal with it.

Lemma A.2. Let II be an index set. For all iIi\in I, PiP_i and QiQ_i are distributions that reside on a common sample space. (θi)iI(\theta_i)_{i\in I} is a collection of non-negative numbers that sums to 1. If ff is a trade-off function and T(Pi,Qi)fT(P_i,Q_i)\geqslant f for all i,i, then

T(θiPi,(1p)θiPi+pθiQi)pf+(1p)Id.T\big(\sum \theta_i P_i, (1-p)\sum \theta_i P_i +p\sum \theta_i Q_i\big)\geqslant p f+ (1-p)\mathrm{Id}.

To apply the lemma, let the index be b{0,1}n\vec{b}\in\{0,1\}^n, PiP_i be M(Sb)M(S_{\vec{b}}) and QiQ_i be M(Sb{x0})M(S_{\vec{b}}\cup \{x_0\}). Condition T(Pi,Qi)fT(P_i,Q_i)\geqslant f is a consequence of MM being ff-DP. The conclusion simply translates to

T(MSamplep(S),MSamplep(S))pf+(1p)Id,T\big(M\circ \mathtt{Sample}_p(S), M\circ \mathtt{Sample}_p(S')\big)\geqslant pf+(1-p)\mathrm{Id},

which is what we want. The proof is complete.\hskip 1.5in \square

Proof of Lemma A.2. Let P=θiPiP = \sum \theta_i P_i and Q=(1p)θiPi+pθiQiQ = (1-p)\sum \theta_i P_i +p\sum \theta_i Q_i. Suppose ϕ\phi satisfies EPϕ=α\mathbb{E}_P \phi=\alpha. That is,

θiEPiϕ=α.\sum \theta_i \mathbb{E}_{P_i}\phi = \alpha.

It is easy to see that

EQϕ=(1p)α+pθiEQiϕ.\begin{aligned} \mathbb{E}_Q \phi = (1-p)\alpha+p \sum \theta_i \mathbb{E}_{Q_i}\phi. \end{aligned}

We know that T(Pi,Qi)fT(P_i,Q_i)\geqslant f. Hence EQiϕ1f(EPiϕ)\mathbb{E}_{Q_i}\phi\leqslant 1-f(\mathbb{E}_{P_i}\phi) and so

θiEQiϕ1θif(EPiϕ).\begin{aligned} \sum \theta_i \mathbb{E}_{Q_i}\phi \leqslant 1-\sum \theta_i f(\mathbb{E}_{P_i}\phi). \end{aligned}

Since ff is convex, Jensen’s inequality implies

θif(EPiϕ)f(θiEPiϕ)=f(α).\qquad\qquad \sum \theta_i f(\mathbb{E}_{P_i}\phi)\geqslant f(\sum \theta_i \mathbb{E}_{P_i}\phi)=f(\alpha).

\hskip 4in \square

Next we use a figure to justify the claim we made in Section 2.2 that “CLT approximation works well for SGD”. Recall that we argued in Section 3 that Algorithms 1 and 2 are min{f,f1}\min\{f,f^{-1}\}^{**}-DP where

f=(pG1/σ+(1p)Id)T.f = \big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}.

This function converges to GμG_\mu with μ=νe1/σ21\mu = \nu\sqrt{\mathrm{e}^{1/\sigma^2}-1} as TT\to\infty provided pTνp\sqrt{T}\to\nu. In the following figure, we numerically compute ff (blue dashed) and compare it with the predicted limit GμG_\mu (red solid). More specifically, the configuration is designed to illustrate the fast convergence in the setting of the second line of Table 1, i.e., noise scale σ=1.1\sigma =1.1, final GDP parameter μ=0.57\mu=0.57 and test accuracy 96.6%. Originally the algorithm runs 60 epochs, i.e, \approx 14k iterations. To best illustrate that convergence appears in an early stage, the numerical evaluation uses a much smaller Tnumeric=234T_{\textup{numeric}}=234, i.e, only one epoch. In order to make the final limit consistent, we also enlarge the sample probability to pnumericp_{\textup{numeric}} so that pnumericTnumericp_{\textup{numeric}}\cdot \sqrt{T_{\textup{numeric}}} remain the same.

Figure A.1. (pG1/σ + (1 − p)Id) ⊗ T (blue dashed) is numerically computed and compared with the GDP limit (red solid) predicted by CLT. The two are almost identical at merely epoch one.

We remark that when σ\sigma is small, μ=νe1/σ21\mu = \nu\sqrt{\mathrm{e}^{1/\sigma^2}-1} becomes large and yields challenges in the numerical computation of (pG1/σ+(1p)Id)T\big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}. We leave rigorous and complete study to future work.

Appendix B. Omitted Details in Section 3

B.1. Privacy Property of Algorithms 1 and 2.

Theorem 3. Algorithms 1 and 2 are both min{f,f1}\min\{f,f^{-1}\}^{**}-DP with

f=(pG1/σ+(1p)Id)T.f = \big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}.

Proof. The proof is mostly done in the main text, except for the composition step. Let VV be the vector space that all θt\theta_t live in and M~=MSamplep:Xn×VV\widetilde{M}=M\circ \mathtt{Sample}_p: X^n\times V\to V be the gradient update. We have already proved (using Proposition A.1) that for both , if S=S{x0}S'=S\cup\{x_0\}, then M~\widetilde{M} satisfies

T(M(S),M(S))fp:=pG1/σ+(1p)Id.T\big(M(S),M(S')\big)\geqslant f_p:=pG_{1/\sigma}+(1-p)\mathrm{Id}.

Note that we cannot say MM is fpf_p-DP because T(M(S),M(S))T\big(M(S'),M(S)\big) is not necessarily lower bounded by fpf_p. So we need a more specific composition theorem than that stated in Dong, Roth, et al. (2019).

Theorem 4 (Refined Composition). Suppose M1:XY,M2:X×YZM_1:X\to Y, M_2:X\times Y \to Z satisfy the following conditions for any S,SS,S' such that S=S{x0}S'=S\cup\{x_0\}:

  1. T(M1(S),M(S))fT\big(M_1(S),M(S')\big)\geqslant f;

  2. T(M2(S,y),M2(S,y))gT\big(M_2(S,y),M_2(S',y)\big)\geqslant g for any yYy\in Y.

Then the composition M2M1:XY×ZM_2\circ M_1: X\to Y\times Z satisfies

T(M2M1(S),M2M1(S))fgT\big(M_2\circ M_1(S),M_2\circ M_1(S')\big)\geqslant f\otimes g

for any S,SS,S' such that S=S{x0}S'=S\cup\{x_0\}.

The proof of this theorem is identical to that of Theorem 3.2 in Dong, Roth, et al. (2019).

Taking Algorithm 1 as an example, since

NoisySGD:XnV×V××VS(θ1,θ2,,θT)\begin{aligned} \mathrm{\texttt{NoisySGD}}:X^n&\to V\times V\times\cdots\times V\\ S&\mapsto(\theta_1,\theta_2,\ldots, \theta_T) \end{aligned}

is simply the composition of TT copies of M~\widetilde{M}, the above composition theorem implies that

T(NoisySGD(S),NoisySGD(S))(pG1/σ+(1p)Id)T=f.T\big(\mathrm{\texttt{NoisySGD}}(S),\mathrm{\texttt{NoisySGD}}(S')\big)\geqslant\big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T}=f.

Moreover, T(NoisySGD(S),NoisySGD(S))f1T\big(\mathrm{\texttt{NoisySGD}}(S),\mathrm{\texttt{NoisySGD}}(S')\big)\geqslant f^{-1}. The two inequalities allow us to conclude that any trade-off function of neighboring distributions must be lower bounded by at least one of ff and f1f^{-1}, hence min{f,f1}\min\{f,f^{-1}\}, hence min{f,f1}\min\{f,f^{-1}\}^{**}. In other words, NoisySGD is min{f,f1}\min\{f,f^{-1}\}^{**}-DP.

For NoisyAdam, we argued that its privacy property is the same as NoisySGD in each iteration, so the above argument also applies, and we have the same conclusion.\hskip 3.5in \square

B.2. Justifying CLT for Algorithms 1 and 2.

The main purpose of this section is to show the following theorem.

Theorem 5. Suppose pp depends on TT and pTνp\sqrt{T}\to \nu. Then we have the following uniform convergence as TT\to\infty:

(pG1/σ+(1p)Id)T=Gμ,\big(pG_{1/\sigma}+(1-p)\mathrm{Id}\big)^{\otimes T} = G_\mu,

where μ=νT(e1/σ21)\mu = \nu\cdot \sqrt{T(\mathrm{e}^{1/\sigma^2}-1)}.

This theorem is a corollary of the following more general CLT on composition of subsample mechanisms and Lemma B.1 below.

Theorem 6. Suppose ff is a trade-off function such that (1) f(0)=1f(0)=1, (2) f(x)>0f(x)>0, for all x<1x<1 and (3) 01(f(x)+1)4dx<+\int_0^1(f'(x)+1)^4\,\mathrm{d}x<+\infty. Let fp=pf+(1p)Idf_p = pf+(1-p)\mathrm{Id} as usual. Furthermore, assume pTνp\sqrt{T} \to \nu as TT\to\infty for some constant ν>0\nu > 0. Then we have the uniform convergence

fpTGνχ2(f)f_p^{\otimes T}\to G_{\nu\sqrt{\chi^2(f)}}

as TT \to \infty, where χ2(f)=01(f(x))2dx1\chi^2(f) = \int_0^1 \big(f'(x)\big)^2\,\mathrm{d}x-1.

Lemma B.1. We have

χ2(G1/σ)=e1/σ21.\begin{aligned} \chi^2(G_{1/\sigma}) = \mathrm{e}^{1/\sigma^2}-1.\end{aligned}

In order to prove Theorem 6, we need an even more general CLT. The first privacy CLT was introduced in Dong, Roth, et al. (2019). However, that version is valid only when each component trade-off function is symmetric, which is not true for pG1/σ+(1p)IdpG_{1/\sigma}+(1-p)\mathrm{Id}. In order to state the general CLT that applies to asymmetric trade-off functions, we must introduce the following functionals:

kl(f):=01logf(x)dxkl~(f):=01f(x)logf(x)dxκ2(f):=01log2f(x)dxκ~2(f):=01f(x)log2f(x)dxκ3(f):=01log3f(x)dxκ~3(f):=01f(x)log3f(x)dx.\begin{aligned} \mathrm{kl}(f) &:= -\int_0^1\log |f'(x)|\,\mathrm{d}x\\ \tilde{\mathrm{kl}}(f) &:= \int_0^1|f'(x)|\log |f'(x)|\,\mathrm{d}x\\ \kappa_2(f)&:=\int_0^1\log^2 |f'(x)| \,\mathrm{d}x\\ \tilde{\kappa}_2(f)&:=\int_0^1|f'(x)|\log^2 |f'(x)| \,\mathrm{d}x\\ \kappa_3(f)&:=\int_0^1\big|\log^3 |f'(x)|\big|\,\mathrm{d}x\\ \tilde{\kappa}_3(f)&:=\int_0^1|f'(x)|\cdot\big|\log^3 |f'(x)|\big|\,\mathrm{d}x.\end{aligned}

Theorem 7. Let {fni:1in}n=1\{f_{ni}: 1\leqslant i \leqslant n\}_{n=1}^{\infty} be a triangular array of (possibly asymmetric) trade-off functions and assume the following limits for some constants K0K \geqslant 0 and s>0s > 0 as nn \to \infty:

  1. i=1nkl(fni)+kl~(fni)K;\sum_{i=1}^n \mathrm{kl}(f_{ni})+\tilde{\mathrm{kl}}(f_{ni})\to K;

  2. max1inkl(fni)0,max1inkl~(fni)0;\max_{1\leqslant i\leqslant n} \mathrm{kl}(f_{ni}) \to 0, \quad \max_{1\leqslant i\leqslant n} \tilde{\mathrm{kl}}(f_{ni}) \to 0;

  3. i=1nκ2(fni)s2,i=1nκ~2(fni)s2;\sum_{i=1}^n \kappa_2(f_{ni})\to s^2, \quad\sum_{i=1}^n \tilde{\kappa}_2(f_{ni})\to s^2;

  4. i=1nκ3(fni)0,i=1nκ~3(fni)0\sum_{i=1}^n \kappa_3(f_{ni})\to 0,\quad \sum_{i=1}^n \tilde{\kappa}_3(f_{ni})\to 0.

Then, we have

limnfn1fn2fnn(α)=GK/s(α)\lim_{n\to \infty} f_{n1}\otimes f_{n2} \otimes \cdots \otimes f_{nn} (\alpha) = G_{K/s}(\alpha)

uniformly for all α[0,1]\alpha \in [0,1].

Proof of this theorem exactly mimics that of Theorem 3.5 in Dong, Roth, et al. (2019), which we omit here due to its length and tedium.

Next, we apply the asymmetric CLT to (pf+(1p)Id)T\big(pf+(1-p)\mathrm{Id}\big)^{\otimes T} and prove Theorem 6. We start by collecting the necessary expressions into the following lemma. All of the proofs are straightforward.

Lemma B.2. Let g(x)=f(x)1=f(x)1g(x) = -f'(x)-1 = |f'(x)|-1. Then

kl(fp)=01log(1+pg(x))dxkl~(fp)=01(1+pg(x))log(1+pg(x))dxκ2(fp)=01log2(1+pg(x))dxκ~2(fp)=01(1+pg(x))log2(1+pg(x))dxκ3(fp)=01log3(1+pg(x))dxκ~3(fp)=01(1+pg(x))log3(1+pg(x))dx.\begin{aligned} {\mathrm{kl}}(f_p) &= - \int_0^1\log(1+pg(x))\,\mathrm{d}x\\ \tilde{\mathrm{kl}}(f_p) &=\int_0^1(1+pg(x))\log(1+pg(x))\,\mathrm{d}x\\ \kappa_2(f_p)&= \int_0^{1}\log^2 \big(1+pg(x)\big)\,\mathrm{d}x \\ \tilde{\kappa}_2(f_p)&= \int_0^{1}\big(1+pg(x)\big)\log^2 \big(1+pg(x)\big)\,\mathrm{d}x \\ {\kappa}_3(f_p)&=\int_0^{1}\log^3 \big(1+pg(x)\big)\,\mathrm{d}x\\ \tilde{\kappa}_3(f_p)&=\int_0^{1}\big(1+pg(x)\big)\log^3 \big(1+pg(x)\big)\,\mathrm{d}x. \end{aligned}

Proof of Theorem 6. It suffices to compute the limits in the asymmetric Central Limit Theorem 7, namely

T(kl(fp)+kl~(fp)),Tκ2(fp),Tκ~2(fp),Tκ3(fp) and Tκ~3(fp).T\cdot \big(\mathrm{kl}(f_p)+\tilde{\mathrm{kl}}(f_p)\big), \,\,T\cdot \kappa_2(f_p),\,\, T\cdot \tilde{\kappa}_2(f_p),\,\, T\cdot {\kappa}_3(f_p)\,\text{ and }\, T\cdot \tilde{\kappa}_3(f_p).

Since Tp2T\sim p^{-2}, we can consider p2(kl(fp)+kl~(fp))p^{-2}\big(\mathrm{kl}(f_p)+\tilde{\mathrm{kl}}(f_p)\big) and so on.

As in Lemma B.2, let g(x)=f(x)1=f(x)1g(x) = -f'(x)-1 = |f'(x)|-1. The assumption expressed in terms of gg is simply

01g4(x)dx<.\int_0^1 g^4(x)\,\mathrm{d}x<\infty.

In particular, it implies g(x)k|g(x)|^k is integrable over [0,1][0,1] for k=2,3,4k=2,3,4. In addition, by ,

χ2(f)=01(f(x))2dx1=01(f(x)+1)2dx=01g2(x)dx.\chi^2(f) = \int_0^{1}\big(f'(x)\big)^2\,\mathrm{d}x-1= \int_0^{1}\big(f'(x)+1\big)^2\,\mathrm{d}x= \int_0^{1}g^2(x)\,\mathrm{d}x.

For the functional kl\mathrm{kl}, by Lemma B.2,

(B.1)     limp0+1p2(kl(fp)+kl~(fp))=limp0+01g(x)1plog(1+pg(x))dx=01g(x)limp0+1plog(1+pg(x))dx=01g2(x)dx=χ2(f).\begin{aligned} (\text{B}.1) \ \ \ \ \ \lim_{p\to0^+}\frac{1}{p^2}\,\big(\mathrm{kl}(f_p)+\tilde{\mathrm{kl}}(f_p)\big) &= \lim_{p\to0^+}\int_0^{1}g(x)\cdot \frac{1}{p}\log \big(1+pg(x)\big)\,\mathrm{d}x\\ &= \int_0^{1}g(x)\cdot \lim_{p\to0^+}\frac{1}{p}\log \big(1+pg(x)\big)\,\mathrm{d}x\\ &=\int_0^{1}g^2(x)\,\mathrm{d}x=\chi^2(f). \end{aligned}

Changing the order of the limit and the integral in (B.1) is permissible by the dominated convergence theorem. To see this, notice that log(1+x)x\log(1+x)\leqslant x. The integrand in (B.1) satisfies

0g(x)1plog(1+pg(x))g2(x).\begin{aligned} 0\leqslant g(x)\cdot \frac{1}{p}\log \big(1+pg(x)\big)\leqslant g^2(x). \end{aligned}

We already argued that g(x)2g(x)^2 is integrable, so it works as a dominating function and the limit is justified. When pTνp\sqrt{T}\to \nu, we have

Tkl(fp)ν2χ2(f).T\cdot \mathrm{kl}(f_p)\to \nu^2\cdot\chi^2(f).

So the constant KK in Theorem 7 is ν2χ2(f)\nu^2\cdot\chi^2(f).

For the functional κ2\kappa_2 we have

1p2κ2(fp)=01[1plog(1+pg(x))]2dx.\begin{aligned} \frac{1}{p^2}\kappa_2(f_p) &= \int_0^{1}\Big[\frac{1}{p}\log \big(1+pg(x)\big)\Big]^2\,\mathrm{d}x. \end{aligned}

By a similar dominating function argument,

limp0+1p2κ2(fp)=limp0+1p2κ~2(fp)=01g2(x)dx=χ2(f).\begin{aligned} \lim_{p\to0^+}\frac{1}{p^2}\,\kappa_2(f_p) = \lim_{p\to0^+}\frac{1}{p^2}\,\tilde{\kappa}_2(f_p) = \int_0^{1}g^2(x)\,\mathrm{d}x=\chi^2(f). \end{aligned}

Adding in the limit pTνp\sqrt{T}\to \nu, we know s2s^2 in Theorem 7 is ν2χ2(f)\nu^2\cdot\chi^2(f).

The same argument involving g3(x)\big|g^3(x)\big| and g4(x)g^4(x) applies to the functional κ3\kappa_3 and κ~3\tilde{\kappa}_3 respectively and yields

limp0+1p3κ3(fp)=limp0+1p3κ~3(fp)=01g3(x)dx.\lim_{p\to0^+}\frac{1}{p^3}\,\kappa_3(f_p) =\lim_{p\to0^+}\frac{1}{p^3}\,\tilde{\kappa}_3(f_p) = \int_0^{1}g^3(x)\,\mathrm{d}x.

Note the different power for pp in the denominator. It means κ3(fp)=o(p2)\kappa_3(f_p) = o(p^2) and hence Tκ3(fp)0T\cdot \kappa_3(f_p)\to 0 when pTνp\sqrt{T}\to \nu.

Hence all the limits in Theorem 7 are verified and we have a GμG_\mu limit where

μ=K/s=s=ν2χ2(f)=νχ2(f).\mu = K/s = s = \sqrt{\nu^2\cdot\chi^2(f)} = \nu\cdot\sqrt{\chi^2(f)}.

This completes the proof.\hskip 2.5in \square

We finish the section by proving the formula in Lemma B.1.

Proof of Lemma B.1. The best calculation is done via better understanding. We point out that the functional χ2\chi^2 is doing nothing more than computing the famous χ2\chi^2-divergence. Recall that Neyman χ2\chi^2-divergence (reverse Pearson) of P,QP,Q is defined as

χ2(PQ):=EP[(dQdP1)2]\chi^2(P\|Q):= \mathbb{E}_P\big[(\tfrac{\,\mathrm{d}Q}{\,\mathrm{d}P}-1)^2\big]

Lemma B.3. If f=T(P,Q)f=T(P,Q) and f(0)=1f(0)=1, f(x)>0f(x)>0 for all x<1x<1, then χ2(f)=χ2(PQ)\chi^2(f) = \chi^2(P\|Q).

This lemma is a straightforward corollary of Proposition B.4 in Dong, Roth, et al. (2019), which gives expressions for all FF-divergence12. In particular, if f=T(P,Q)f=T(P,Q) and f(0)=1f(0)=1, f(x)>0  x<1f(x)>0\ \ \forall x<1, then FF-divergence of P,QP,Q can be computed from their trade-off function as follows:

DF(PQ)=01F(f(x)1)f(x)dx.D_F(P\|Q) = \int_0^1 F\big({\big|f'(x)\big|}^{-1}\big)\cdot \big|f'(x)\big| \,\mathrm{d}x.

Neyman χ2\chi^2-divergence corresponds to F(t)=1t1F(t) = \frac{1}{t}-1, so

χ2(PQ)=01(1f(x)11)f(x)dx=01(f(x))2dx01f(x)dx=01(f(x))2dx1.\begin{aligned} \chi^2(P\|Q) &= \int_0^1 \bigg(\frac{1}{\big|f'(x)\big|^{-1}}-1\bigg)\cdot \big|f'(x)\big| \,\mathrm{d}x\\ &=\int_0^1 \big(f'(x)\big)^2 \,\mathrm{d}x- \int_0^1 \big|f'(x)\big| \,\mathrm{d}x\\ &=\int_0^1 \big(f'(x)\big)^2 \,\mathrm{d}x- 1.\end{aligned}

With this formula, computing χ2(G1/σ)\chi^2(G_{1/\sigma}) is straightforward:

χ2(G1/σ)=χ2(N(1σ,1)N(0,1))=e1/σ21.\chi^2(G_{1/\sigma}) = \chi^2\big(\mathcal{N}(\tfrac{1}{\sigma},1)\|\mathcal{N}(0,1)\big) = \mathrm{e}^{1/\sigma^2}-1.

B.3. Proof of Theorems 1 and 2.

Recall that Theorems 1 and 2 compare our CLT approach to moments accountant (MA\mathtt{MA}) from two different perspectives: ff-DP perspective in Theorem 1 and (ε,δ)(\varepsilon,\delta)-DP perspective in Theorem 2. We first show that Theorem 1 can be derived from Theorem 2. Then we prove a refined version of Theorem 2. To be more precise about the statement, let us first expand the notation used in the main text.

Let δMA(ε;σ,p,T)\delta_\texttt{MA}(\varepsilon;\sigma,p,T) be the δ\delta value computed by the moments accountant method (described in detail below) for NoisySGD algorithm with subsampling probability pp, iteration TT and noise scale σ\sigma. Similarly, δCLT(ε;σ,ν)\delta_\texttt{CLT}(\varepsilon;\sigma,\nu) denotes the δ\delta value computed for the same algorithm using the central limit theorem assuming pTνp\sqrt{T}\to \nu.

Let fT(α)=supε0fε,δMA(ε)(α)f_T(\alpha)=\sup_{\varepsilon\geqslant 0} f_{\varepsilon, \delta_{\mathtt{MA}}(\varepsilon)}(\alpha). It is supported by fεT,δMA(εT)f_{\varepsilon_T, \delta_{\mathtt{MA}}(\varepsilon_T)} at α\alpha. Theorem 2 says this supporting function is smaller than that of GμCLTG_{\mu_{\mathtt{CLT}}} at α\alpha by a strict gap. Taking the limit, lim supTfT(α)\limsup_{T \to \infty} f_T(\alpha) possesses at least that much gap from GμCLT(α)G_{\mu_{\mathtt{CLT}}}(\alpha), which proves Theorem 1.

Theorem 2 is a straightforward corollary of the following proposition. Note that the inequality is reversed compared to the statement of Theorem 2 so that the gap is positive, which also turns lim sup\limsup into lim inf\liminf.

Proposition B.4.

lim infTδMA(ε;σ,νT,T)δCLT(ε;σ,ν)eεΦ(εμμ2)\liminf_{T\to\infty} \,\,\delta_{\mathtt{MA}}(\varepsilon;\sigma,\tfrac{\nu}{\sqrt{T}},T) - \delta_{\mathtt{CLT}}(\varepsilon;\sigma,\nu) \geqslant \mathrm{e}^{\varepsilon}\cdot\Phi(-\frac{\varepsilon}{\mu}-\frac{\mu}{2})

where μ=νe1/σ21\mu=\nu\cdot \sqrt{\mathrm{e}^{1/\sigma^2}-1}.

Let us first describe how the two methods compute δ\delta from ε\varepsilon:

δnMA(ε;σ,p,T):=infλdomainexp(TαGM(λ;σ,p)λε),δMA(ε;σ,p,T):=infλ>0exp(TαGM(λ;σ,p)λε)\begin{aligned} \delta_{\mathtt{nMA}}(\varepsilon;\sigma,p,T) &:= \inf_{\lambda\in \textup{domain}} \exp\left( T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p) - \lambda \varepsilon\right),\\ \delta_{\mathtt{MA}}(\varepsilon;\sigma,p,T) &:= \inf_{\lambda>0} \exp\left( T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p) - \lambda \varepsilon\right)\end{aligned}

where αGM(λ;σ,p)\alpha_{\textup{GM}}(\lambda;\sigma,p) is a scaled version of the Rényi divergence of Gaussian mixtures. More specifically, let P=N(0,1)P = \mathcal{N}(0,1) and Q=N(1σ,1)Q=\mathcal{N}(\frac{1}{\sigma},1). We further denote the Gaussian mixture pQ+(1p)Pp Q + (1-p) P by GMp,σ\mathrm{GM}_{p,\sigma}. Then

αGM(λ;σ,p)=max{λDλ+1(GMp,σP),λDλ+1(PGMp,σ)}.\begin{aligned} \alpha_{\textup{GM}}(\lambda;\sigma,p) =\max\Big\{\lambda D_{\lambda+1}(\mathrm{GM}_{p,\sigma}\|P), \lambda D_{\lambda+1}(P\|\mathrm{GM}_{p,\sigma})\Big\}.\end{aligned}

In Abadi et al. (2016), it has been shown that Algorithm 1 (hence also the Adam variant, Algorithm 2) with subsampling probability pp, iteration TT and noise scale σ\sigma is (ε,δ)(\varepsilon,\delta)-DP for each ε0\varepsilon\geqslant0 if δ=δMA(ε;σ,p,T)\delta=\delta_{\mathtt{MA}}(\varepsilon;\sigma,p,T). To evaluate the infimum, the domain is discretized13. This results in the numerical moment accountant method that is actually implemented. Since δnMA(ε;σ,p,T)\delta_{\mathtt{nMA}}(\varepsilon;\sigma,p,T)\geqslant δMA(ε;σ,p,T)\delta_{\mathtt{MA}}(\varepsilon;\sigma,p,T), Algorithm 1 is also (ε,δ)(\varepsilon,\delta)-DP with δ=δnMA(ε;σ,p,T)\delta=\delta_{\mathtt{nMA}}(\varepsilon;\sigma,p,T).

On the other hand, δCLT(ε;σ,ν)\delta_{\texttt{CLT}}(\varepsilon;\sigma,\nu) is obtained by first observing Algorithm 1 is asymptotically μCLT\mu_{\texttt{CLT}}-GDP with μCLT=νe1/σ21\mu_{\texttt{CLT}} = \nu\cdot \sqrt{\mathrm{e}^{1/\sigma^2}-1} and then convert GDP to (ε,δ)(\varepsilon,\delta)-DP via formula (3.2), i.e., Algorithm 1 asymptotically satisfies (ε,δ)(\varepsilon,\delta)-DP where

δ=δCLT(ε;σ,ν)=1+GμCLT(eε).\delta = \delta_{\texttt{CLT}}(\varepsilon;\sigma,\nu) =1 + G^*_{\mu_{\texttt{CLT}}}(-\mathrm{e}^{\varepsilon}). % = \Phi( -\frac{\varepsilon}{\mu_{\texttt{CLT}}} +\frac{\mu_{\texttt{CLT}}}{2})- \e^{\varepsilon}\Phi(- \frac{\varepsilon}{\mu_{\texttt{CLT}}} - \frac{\mu_{\texttt{CLT}}}{2}).

We have just explained how MA\mathtt{MA} and CLT\texttt{CLT} works. Next we prove Proposition B.4.

Proof of Proposition B.4. Let fT=(pGμ+(1p)Id)Tf_{T}=\big(pG_\mu+(1-p)\mathrm{Id}\big)^{\otimes T}. We need a lemma (whose proof is provided later) that relates the Rényi divergence to the trade-off function fTf_{T}.

Lemma B.5.

TαGM(λ;σ,p)log01fT(x)λ+1dx.\begin{aligned} T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p)&\geqslant\log \int_0^1 |f_T'(x)|^{\lambda + 1} \,\mathrm{d}x.\end{aligned}

Let xT(0,1)x_T\in(0,1) be the point for which fT(xT)=eεf_T'(x_T) = - \mathrm{e}^\varepsilon (or xTfT(eε)x_T\in\partial f_T^*(- \mathrm{e}^\varepsilon) if readers worry about differentiability). We have

1+fT(eε)=sup0x1{1fT(x)eεx}=1fT(xT)eεxT.1 + f_T^*(-\mathrm{e}^{\varepsilon}) = \sup_{0 \leqslant x \leqslant 1} \{1 - f_T(x) - \mathrm{e}^{\varepsilon} x \} = 1 - f_T(x_T) - \mathrm{e}^{\varepsilon} x_T.

It is clear that fT(x)eε|f_T'(x)| \geqslant\mathrm{e}^\varepsilon for 0xxT0 \leqslant x \leqslant x_T.

On the other hand, using Lemma B.5, we obtain

δMA(ε;σ,p,T)=infλ>0exp(TαGM(λ;σ,p)λε)infλ>0eλε01fT(x)λ+1dx>infλ>00xTfT(x)λ+1eλεdx=infλ>00xTfT(x)fT(x)λeλεdxinfλ>00xTfT(x)(eε)λeλεdx=fT(0)fT(xT)=1fT(xT)=(1fT(xT)eεxT)+eεxT=1+fT(eε)+eεxT.\begin{aligned} \delta_\mathtt{MA}(\varepsilon;\sigma,p,T) &= \inf_{\lambda>0} \exp\left( T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p) - \lambda \varepsilon\right)\\ & \geqslant \inf_{\lambda>0} \mathrm{e}^{-\lambda \varepsilon}\int_0^1 |f_T'(x)|^{\lambda + 1} \,\mathrm{d}x\\ & > \inf_{\lambda>0} \int_0^{x_T} |f_T'(x)|^{\lambda + 1} \mathrm{e}^{-\lambda \varepsilon} \,\mathrm{d}x\\ & = \inf_{\lambda>0} \int_0^{x_T} |f'_T(x)| \cdot |f'_T(x)|^{\lambda} \mathrm{e}^{-\lambda \varepsilon} \,\mathrm{d}x\\ & \geqslant\inf_{\lambda>0} \int_0^{x_T} |f'_T(x)| \cdot (\mathrm{e}^{\varepsilon})^{\lambda} \mathrm{e}^{-\lambda \varepsilon} \,\mathrm{d}x\\ &= f_T(0) - f_T(x_T)\\ &= 1 - f_T(x_T)\\ &= \big(1 - f_T(x_T) - \mathrm{e}^{\varepsilon} x_T\big) + \mathrm{e}^{\varepsilon} x_T\\ &= 1 + f_T^*(-\mathrm{e}^{\varepsilon})+\mathrm{e}^{\varepsilon} x_T.\end{aligned}

In summary, we have

(B.2)δMA(ε;σ,p,T)>1+fT(eε)+eεxT.\begin{aligned} (\text{B}.2) \qquad\qquad\qquad \delta_\mathtt{MA}(\varepsilon;\sigma,p,T) > 1 + f_T^*(-\mathrm{e}^{\varepsilon})+\mathrm{e}^{\varepsilon} x_T.\end{aligned}

Setting p=νTp=\frac{\nu}{\sqrt{T}}, we would like to take limit on both sides of (B.2). First notice that fTf_T converges pointwise to GμCLTG_{\mu_{\texttt{CLT}}}, which we have already proven in Appendix B.2. The limit of xTx_T is evaluated in the following lemma:

Lemma B.6.

limTxT=x:=Φ(εμCLTμCLT2).\lim_{T\to\infty} x_T = x^* := \Phi(-\frac{\varepsilon}{\mu_{\texttt{\textup{CLT}}}}-\frac{\mu_{\texttt{\textup{CLT}}}}{2}).

Combining these results, we can take limits on both sides of (B.2):

lim infTδMA(ε;σ,p,T)limT1+fT(eε)+eεxT=1+GμCLT(eε)+eεx=δCLT(ε;σ,ν)+eεΦ(εμCLTμCLT2).\begin{aligned} \liminf_{T\to\infty} \delta_\mathtt{MA}(\varepsilon;\sigma,p,T) &\geqslant \lim_{T\to\infty} 1 + f_T^*(-\mathrm{e}^{\varepsilon})+\mathrm{e}^{\varepsilon} x_T\\ &= 1+G_{\mu_{\texttt{CLT}}}^*(-\mathrm{e}^{\varepsilon})+\mathrm{e}^{\varepsilon} x^*\\ &=\delta_{\texttt{CLT}}(\varepsilon;\sigma,\nu)+\mathrm{e}^{\varepsilon}\cdot \Phi(-\frac{\varepsilon}{\mu_{\texttt{CLT}}}-\frac{\mu_{\texttt{CLT}}}{2}).\end{aligned}

This finishes the proof.\hskip 2.9in \square

Proof of Lemma B.5. The Rényi divergence can also be computed from the trade-off function, just like the χ2\chi^2-divergence. In fact, under the same assumptions as in Lemma B.1, we have

Dα(QP)=1α1log01f(x)α1dx.D_\alpha(Q\|P) = \frac{1}{\alpha-1}\log \int_0^1|f'(x)|^{\alpha-1}\,\mathrm{d}x.

Alternatively,

(B.3)λDλ+1(QP)=log01f(x)λdx.(\text{B}.3) \qquad\qquad\qquad \lambda D_{\lambda+1}(Q\|P) =\log \int_0^1|f'(x)|^{\lambda}\,\mathrm{d}x.

This identity will be the bridge between αGM\alpha_{\textup{GM}} and fTf_T.

On the one hand, αGM(λ;σ,p)\alpha_{\textup{GM}}(\lambda;\sigma,p) is the maximum of two Rényi divergences, so

αGM(λ;σ,p)λDλ+1(pQ+(1p)PP).\begin{aligned} \alpha_{\textup{GM}}(\lambda;\sigma,p) \geqslant\lambda D_{\lambda+1}(p Q + (1-p) P\|P).\end{aligned}

Consequently,

TαGM(λ;σ,p)TλDλ+1(pQ+(1p)PP)=λDλ+1((pQ+(1p)P)TPT).\begin{aligned} T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p) &\geqslant T\lambda D_{\lambda+1}(p Q + (1-p) P\|P)\\ &=\lambda D_{\lambda+1}\big((p Q + (1-p) P)^T\|P^T).\end{aligned}

The last step is the tensorization identity of Rényi divergence.

On the other hand, notice that pGμ+(1p)Id=T(P,GMp,σ)pG_\mu+(1-p)\mathrm{Id}=T\big(P, \mathrm{GM}_{p,\sigma}\big), where we continue the use of notation P=N(0,1),Q=N(1σ,1)P = \mathcal{N}(0,1), Q=\mathcal{N}(\frac{1}{\sigma},1) and GMp,σ=pQ+(1p)P\mathrm{GM}_{p,\sigma} = p Q + (1-p) P. We have14

fT=(pGμ+(1p)Id)T=T(P,(pQ+(1p)P))T=T(PT,(pQ+(1p)P)T).\begin{aligned} f_T =\big(pG_\mu+(1-p)\mathrm{Id}\big)^{\otimes T} =T\big(P, (pQ + (1-p)P)\big)^{\otimes T} =T\big(P^T, (pQ + (1-p)P)^T\big).\end{aligned}

Using (B.3), we have

TαGM(λ;σ,p)λDλ+1((pQ+(1p)P)TPT)=log01fT(x)λdx.\begin{aligned} T\cdot\alpha_{\textup{GM}}(\lambda;\sigma,p) &\geqslant\lambda D_{\lambda+1}\big((p Q + (1-p) P)^T\|P^T)\\ &=\log \int_0^1|f_T'(x)|^{\lambda}\,\mathrm{d}x.\end{aligned}

\hskip 4in \square

Proof of Lemma B.6. By definition, fT(xT)=eεf_T'(x_T) = -\mathrm{e}^\varepsilon. The convexity of fTf_T implies fT(eε)=xT\nabla f_T^*(-\mathrm{e}^\varepsilon) = x_T.

Since fTf_T converges uniformly to GμCLTG_{\mu_{\texttt{\textup{CLT}}}} over [0,1][0,1], we have uniform convergence fTGμCLTf_T^*\to G_{\mu_{\texttt{\textup{CLT}}}}^*. By convexity of these functions, the convergence also implies the convergence of derivatives (see Theorem 25.7 of Rockafellar, 1970), namely,

fTGμCLT.\nabla f_T^*\to \nabla G_{\mu_{\texttt{\textup{CLT}}}}^*.

Therefore,

xT=fT(eε)GμCLT(eε).\begin{aligned} x_T = \nabla f_T^*(-\mathrm{e}^\varepsilon)\to \nabla G_{\mu_{\texttt{\textup{CLT}}}}^*(-\mathrm{e}^\varepsilon). \end{aligned}

Let x=GμCLT(eε)x^*=\nabla G_{\mu_{\texttt{\textup{CLT}}}}^*(-\mathrm{e}^\varepsilon) be the limit. Using the convexity again, we have

eε=GμCLT(x).\begin{aligned} -\mathrm{e}^\varepsilon= G_{\mu_{\texttt{\textup{CLT}}}}'(x^*). \end{aligned}

We can solve for xx^* using the expression of GμG_\mu (3.2). After some algebra, we have

x=Φ(εμCLTμCLT2).x^* = \Phi(-\frac{\varepsilon}{\mu_{\texttt{\textup{CLT}}}}-\frac{\mu_{\texttt{\textup{CLT}}}}{2}).

The proof is complete.\hskip 2.8in \square

Comments
0
comment
No comments here
Why not start the discussion?