Deep Learning with Gaussian Differential Privacy

Deep learning models are often trained on datasets that contain sensitive information such as individuals' shopping transactions, personal contacts, and medical records. An increasingly important line of work therefore has sought to train neural networks subject to privacy constraints that are specified by differential privacy or its divergence-based relaxations. These privacy definitions, however, have weaknesses in handling certain important primitives (composition and subsampling), thereby giving loose or complicated privacy analyses of training neural networks. In this paper, we consider a recently proposed privacy definition termed f-differential privacy [17] for a refined privacy analysis of training neural networks. Leveraging the appealing properties of f-differential privacy in handling composition and subsampling, this paper derives analytically tractable expressions for the privacy guarantees of both stochastic gradient descent and Adam used in training deep neural networks, without the need of developing sophisticated techniques as [3] did. Our results demonstrate that the f-differential privacy framework allows for a new privacy analysis that improves on the prior analysis [3], which in turn suggests tuning certain parameters of neural networks for a better prediction accuracy without violating the privacy budget. These theoretically derived improvements are confirmed by our experiments in a range of tasks in image classification, text classification, and recommender systems.


Introduction
In many applications of machine learning, the datasets contain sensitive information about individuals such as location, personal contacts, media consumption, and medical records. Exploiting the output of the machine learning algorithm, an adversary may be able to identify some individuals in the dataset, thus presenting serious privacy concerns. This reality gave rise to a broad and pressing call for developing privacy-preserving data analysis methodologies. Accordingly, there have been numerous investigations in the scholarly literature of many fields-statistics, cryptography, machine learning, and law-for the protection of privacy in data analysis.
Along this line, research efforts have repeatedly suggested the necessity of a rigorous and versatile definition of privacy. Among other things, researchers have questioned whether the use of a privacy definition gives interpretable privacy guarantees, and if so, whether this privacy definition allows for high accuracy of the private model among alternative definitions. In particular, anonymization as a syntactic and ad-hoc privacy concept has been shown to generally fail to guarantee privacy. Examples include the identification of a homophobic individual in the anonymized Netflix Challenge dataset [43] and the identification of the health records of the then Massachusetts governor in public anonymized medical datasets [54].
In this context, (ε, δ)-differential privacy (DP) arose as a mathematically rigorous definition of privacy [22]. Today, this definition has developed into a firm foundation of private data analysis, with its applications deployed by Google [25], Apple [5], Microsoft [15], and the US Census Bureau [4]. Despite its impressive popularity in both the scholarly literature and the industry, (ε, δ)-DP is not versatile enough to handle composition, which is perhaps the most fundamental primitive in statistical privacy. For example, the training process of deep neural networks is in effect the composition of many primitive building blocks known as stochastic gradient descent (SGD). Under a modest privacy budget in the (ε, δ)-DP sense, however, it was not clear how to maintain a high prediction accuracy of deep learning. This requires a tight privacy analysis of composition in the (ε, δ)-DP framework. Indeed, the analysis of the privacy costs in deep learning was refined only recently using a sophisticated technique called the moments accountant [3].
Ideally, we hope to have a privacy definition that allows for refined privacy analyses of various algorithms in a principled manner, without resorting to sophisticated techniques. Having a refined privacy analysis not only enhances the trustworthiness of the models but can also be leveraged to improve the prediction accuracy by trading off privacy for utility. One possible candidate is fdifferential privacy, a relaxation of (ε, δ)-DP that was recently proposed by Dong, Roth, and Su [17]. This new privacy definition faithfully retains the hypothesis testing interpretation of differential privacy and can losslessly reason about common primitives associated with differential privacy, including composition, privacy amplification by subsampling, and group privacy. In addition, f -DP includes a canonical single-parameter family that is referred to as Gaussian differential privacy (GDP). Notably, GDP is the focal privacy definition due to a central limit theorem that states that the privacy guarantees of the composition of private algorithms are approximately equivalent to telling apart two shifted normal distributions.
The main results of this paper show that f -DP offers a rigorous and versatile framework for developing private deep learning methodologies 1 . Our guarantee provides protection against an attacker with knowledge of the network architecture as well as the model parameters, which is in the same spirit as [51,3]. In short, this paper delivers the following messages concerning f -DP: Closed-form privacy bounds. In the f -DP framework, the overall privacy loss incurred in training neural networks admits an amenable closed-form expression. In contrast, the privacy analysis via the moments accountant must be done by numerical computation [3], and the implicit nature of this earlier approach can hinder our understanding of how the tuning parameters affect the privacy bound. This is discussed in Section 3.1.
Stronger privacy guarantees. The f -DP approach gives stronger privacy guarantees than the earlier approach [3], even in terms of (ε, δ)-DP. This improvement is due to the use of the central limit theorem for f -DP, which accurately captures the privacy loss incurred at each iteration in training the deep learning models. This is presented in Section 3.2 and illustrated with numerical experiments in Section 4.1.
Improved prediction accuracy. Leveraging the stronger privacy guarantees provided by f -DP, we can trade a certain amount of privacy for an improvement in prediction performance. This can be realized, for example, by appropriately reducing the amount of noise added during the training process of neural networks so as to match the target privacy level in terms of (ε, δ)-DP. See Section 3.2 and Section 4.2 for the development of this utility improvement.
The remainder of the paper is structured as follows. In Section 1.1 we provide a brief review of related literature. Section 2 introduces f -DP and its basic properties at a minimal level. Next, in Section 3 we analyze the privacy cost of training deep neural networks in terms of f -DP and compare it to the privacy analysis using the moments accountant. In Section 4, we present numerical experiments to showcase the superiority of the f -DP approach to private deep learning in terms of test accuracy and privacy guarantees. The paper concludes with a discussion in Section 5.

Related Work
There are continued efforts to understand how privacy degrades under composition. Developments along this line include the basic composition theorem and the advanced composition theorem [21,23]. In a pioneering work, [31] obtained an optimal composition theorem for (ε, δ)-DP, which in fact served as one of the motivations for the f -DP work [17]. However, it is #P hard to compute the privacy bounds from their composition theorem [42]. More recently, [16] derived sharp composition bounds on the overall privacy loss for exponential mechanisms.
From a different angle, a substantial recent effort has been devoted to relaxing differential privacy using divergences of probability distributions to overcome the weakness of (ε, δ)-DP in handling composition [20,10,41,11]. Unfortunately, these relaxations either lack a privacy amplification by subsampling argument or present a quite complex argument that is difficult to use [7,55]. As subsampling is inherently used in training neural networks, therefore, it is difficult to directly apply these relaxations to the privacy analysis of deep learning.
To circumvent these technical difficulties associated with (ε, δ)-DP and its divergence-based relaxations, Abadi et al. [3] invented a technique termed the moments accountant to track detailed information of the privacy loss in the training process of deep neural networks. Using the moments accountant, their analysis significantly improves on earlier privacy analysis of SGD [13,53,9,51,58] and allows for meaningful privacy guarantees for deep learning trained on realistically sized datasets. This technique has been extended to a variety of situations by follow-up work [39,46]. In contrast, our approach to private deep learning in the f -DP framework leverages some powerful tools of this new privacy definition, nevertheless providing a sharper privacy analysis, as seen both theoretically and empirically in Sections 3 and 4.
For completeness, we remark that different approaches have been put forward to incorporate privacy considerations into deep learning, without leveraging the iterative and subsampling natures of training deep learning models. This line of work includes training a private model by an ensemble of "teacher" models [45,44], the development of noised federated averaging algorithms [40], and analyzing privacy costs through the lens of the optimization landscape of neural networks [57].

f -Differential Privacy
In the differential privacy framework, we envision an adversary that is well-informed about the dataset except for a single individual, and the adversary seeks to determine whether this individual is in the dataset on the basis of the output of an algorithm. Roughly speaking, the algorithm is considered private if the adversary finds it hard to determine the presence or absence of any individual.
Informally, a dataset can be thought of as a matrix, whose rows each contain one individual's data. Two datasets are said to be neighbors if one can be derived by discarding an individual from the other. As such, the sizes of neighboring datasets differ by one 2 . Let S and S be neighboring datasets, and ε 0, 0 δ 1 be two numbers, and denote by M a (randomized) algorithm that takes as input a dataset.
Definition 2.1 ( [22,21]). A (randomized) algorithm M gives (ε, δ)-differential privacy if for any pair of neighboring datasets S, S and any event E, To achieve privacy, the algorithm M is necessarily randomized, whereas the two datasets in Definition 2.1 are deterministic. This privacy definition ensures that, based on the output of the algorithm, the adversary has a limited (depending on how small ε, δ are) ability to identify the presence or absence of any individual, regardless of whether any individual opts in to or opts out of the dataset.
In essence, the adversary seeks to tell apart the two probability distributions M (S) and M (S ) using a single draw. In light of this observation, it is natural to interpret what the adversary does as testing two simple hypotheses: H 0 : the true dataset is S versus H 1 : the true dataset is S . The connection between differential privacy and hypothesis testing was, to our knowledge, first noted in [56], and was later developed in [31,37,8]. Intuitively, privacy is well guaranteed if the hypothesis testing problem is hard. Following this intuition, the definition of (ε, δ)-DP essentially uses the worst-case likelihood ratio of the distributions M (S) and M (S ) to measure the hardness of testing the two simple hypotheses.
Is there a more informative measure of the hardness? In [17], the authors propose to use the trade-off between type I error and type II error in place of a few privacy parameters in (ε, δ)-DP or divergence-based DP definitions. To formally define this new privacy definition, let P and Q denote the distributions of M (S) and M (S ), respectively, and let φ be any (possibly randomized) rejection rule for testing H 0 : P against H 1 : Q. With these in place, [17] defines the trade-off function of P and Q as 2 Alternatively, the neighboring relationship can be defined for datasets of the same size and differing by one individual.
Above, E P [φ] and 1 − E Q [φ] are type I and type II errors of the rejection rule φ, respectively. Writing f = T (P, Q), the definition says that f (α) is the minimum type II error among all tests at significance level α. Note that the minimum can be achieved by taking the likelihood ratio test, according to the Neymann-Pearson lemma. As is self-evident, the larger the trade-off function is, the more difficult the hypothesis testing problem is (hence more privacy). This motivates the following privacy definition. In this definition, the inequality holds pointwise for all 0 α 1, and we abuse notation by identifying M (S) and M (S ) with their associated distributions. This privacy definition is easily interpretable due to its inherent connection with the hypothesis testing problem. By adapting a result due to Wasserman and Zhou [56], (ε, δ)-DP is a special instance of f -DP in the sense that an algorithm is (ε, δ)-DP if and only if it is f ε,δ -DP with The more intimate relationship between the two privacy definitions is that they are dual to each other: briefly speaking, f -DP ensures (ε, δ(ε))-DP with δ(ε) = 1 + f * (−e ε ) for every ε 0 3 . Next, we define a single-parameter family of privacy definitions within the f -DP class for a reason that will be apparent later. Let G µ := T N (0, 1), N (µ, 1) for µ 0. Note that this tradeoff function admits a closed-form expression where Φ is the cumulative distribution function of the standard normal distribution. In words, µ-GDP says that determining whether any individual is in the dataset is at least as difficult as telling apart the two normal distributions N (0, 1) and N (µ, 1) based on one draw. The Gaussian mechanism serves as a template to achieve GDP. Consider the problem of privately releasing a univariate statistic θ(S). The Gaussian mechanism adds N (0, σ 2 ) noise to the statistic θ, which gives µ-GDP if σ = sens(θ)/µ. Here the sensitivity of θ is defined as sens(θ) = sup S,S |θ(S)− θ(S )|, where the supremum is over all neighboring datasets.

Properties of f -Differential Privacy
Composition. Deep learning models are trained using the composition of many SGD updates. Broadly speaking, composition is concerned with a sequence of analyses on the same dataset where each analysis is informed by the explorations of prior analyses. A central question that every privacy definition is faced with is to pinpoint how the overall privacy guarantee degrades under composition. Formally, letting M 1 be the first algorithm and M 2 be the second, we define their composition algorithm M as M (S) = (M 1 (S), M 2 (S, M 1 (S))). Roughly speaking, the composition is to "release all information that is learned by the algorithms." Notably, the second algorithm M 2 can take as input the output of M 1 in addition to the dataset S. In general, the composition of more than two algorithms follows recursively.
To introduce the composition theorem for f -DP, [17] defines a binary operation ⊗ on trade-off functions. Given trade-off functions f = T (P, Q) and g = T (P , Q ), let f ⊗ g = T (P × P , Q × Q ). This definition depends on the distributions P, Q, P , Q only through f and g. Moreover, ⊗ is commutative and associative. Now the composition theorem can be readily stated as follows. Let M t be f t -DP conditionally on any output of the prior algorithms for t = 1, . . . , T . Then their T -fold composition algorithm is f 1 ⊗ · · · ⊗ f T -DP. This result shows that the composition of algorithms in the f -DP framework is reduced to performing the ⊗ operation on the associated trade-off functions. As an important fact, the privacy bound f 1 ⊗ · · · ⊗ f T in general cannot be improved.
More profoundly, a central limit theorem phenomenon arises in the composition of many "very private" f -DP algorithms in the following sense: the trade-off functions of small privacy leakage accumulate to G µ for some µ under composition. Informally, assuming each f t is very close to Id(α) = 1 − α, which corresponds to perfect privacy, then we have if T is sufficiently large. The privacy parameter µ depends on some functionals such as the Kullback-Leibler divergence of the trade-off functions. The central limit theorem yields a very accurate approximation in the settings considered in Section 4 (see numerical confirmation in Appendix A). For a rigorous account of this central limit theorem for differential privacy, see Theorem 3.5 in [17]. We remark that a conceptually related article [52] developed a central limit theorem for privacy loss random variables. At a high level, this convergence-to-GDP result brings GDP to the focal point of the family of f -DP guarantees, implying that GDP is to f -DP as normal random variables to general random variables. Furthermore, this result serves as an effective approximation tool for approximating the privacy guarantees of composition algorithms. In contrast, privacy loss cannot be losslessly tracked under composition in the (ε, δ)-DP framework.
Subsampling. In training neural networks, the gradient at each iteration is computed from a mini-batch that is subsampled from the training examples. Intuitively, an algorithm applied to a subsample gives stronger privacy guarantees than applied to the full sample. Looking closely, this privacy amplification is due to the fact that an individual enjoys perfect privacy if not selected in the subsample. A concrete and pressing question is, therefore, to precisely characterize how much privacy is amplified by subsampling in the f -DP framework.
Consider the following sampling scheme: for each individual in the dataset S, include his or her datum in the subsample independently with probability p, which is sometimes referred to as the Poisson subsampling [55]. The resulting subsample is denoted by Sample p (S). For the purpose of clearing up any confusion, we remark that the subsample Sample p (S) has a random size and as an intermediate step is not released. Given any algorithm M , denote by M • Sample p the subsampled algorithm.
The subsampling theorem for f -DP states as follows. Let M be f -DP, write f p for pf +(1−p)Id, and denote by f −1 p the inverse 4 of f p . It is proved in Appendix A that the subsampled algorithm if S can be obtained by removing one individual from S . Likewise, As such, the two displays above say that the trade-off function of M • Sample p on any neighboring datasets is lower bounded by min{f p , f −1 p }, which however is in general non-convex and thus is not a trade-off function. This suggests that we can boost the privacy bound by replacing min{f p , f −1 p } with its double conjugate min{f p , f −1 p } * * , which is the greatest convex lower bound of min{f p , f −1 p } and is indeed a trade-off function. Taken together, all the pieces show that the subsampled algorithm M • Sample p is min{f p , f −1 p } * * -DP. Notably, the privacy bound min{f p , f −1 p } * * is larger than f and cannot be improved in general. In light of the above, the f -DP framework is flexible enough to nicely handle the analysis of privacy amplification by subsampling. In the case where the original algorithm M is (ε, δ)-DP, this privacy bound strictly improves on the subsampling theorem for (ε, δ)-DP [36].

NoisySGD and NoisyAdam
SGD and Adam [32] are among the most popular optimizers in deep learning. Here we introduce a new privacy analysis of a private variant of SGD in the f -DP framework and then extend the study to a private version of Adam.
Letting S = {x 1 , . . . , x n } denote the dataset, we consider minimizing the empirical risk where θ denotes the weights of the neural networks and (θ, x i ) is a loss function. At iteration t, a mini-batch I t is selected from {1, 2, . . . , n} with subsampling probability p, thereby having an approximate size of pn. Taking learning rate η t and initial weights θ 0 , the vanilla SGD updates the weights according to To preserve privacy, [13,53,9,3] introduce two modifications to the vanilla SGD. First, a clip step is applied to the gradient so that the gradient is in effect bounded. This step is necessary to have a finite sensitivity. The second modification is to add Gaussian noise to the clipped gradient, which is equivalent to applying the Gaussian mechanism to the updated iterates. Formally, the private SGD algorithm is described in Algorithm 1. Herein I is the identity matrix and · 2 denotes the 2 norm. Formally, we present this in Algorithm 1, which we henceforth refer to as NoisySGD and uses the Poisson subsampling. For completeness, we remark that there are two other possible subsampling methods: shuffling (randomly permuting and dividing data into folds at each epoch) and uniform sampling (sampling a batch of size L from the whole data at each iteration). We emphasize that different subsampling mechanisms produce different privacy guarantees.
Parameters: initial weights θ 0 , learning rate η t , subsampling probability p, number of iterations T , noise scale σ, gradient norm bound R.
The analysis of the overall privacy guarantee of NoisySGD makes heavy use of the compositional and subsampling properties of f -DP. We first focus on the privacy analysis of the step that computes θ t+1 from θ t . Let M denote the gradient update and write Sample p (S) for the mini-batch I t (we drop the subscript t for simplicity). This allows us to use M • Sample p (S) to represent what NoisySGD does at each iteration. Next, note that adding or removing one individual would change the value of i∈Itv (i) t by at most R in the 2 norm due to the clipping operation, that is, i∈Itv (i) t has sensitivity R. Consequently, the Gaussian mechanism with noise standard deviation σR ensures To facilitate the use of this privacy bound, we now derive an analytically tractable approximation of min{f, f −1 } * * using the privacy central limit theorem in a certain asymptotic regime, which further demonstrates the mathematical coherence and versatility of the f -DP framework. The central limit theorem shows that, in the asymptotic regime where p √ T → ν for a constant ν > 0 as T → ∞, where µ = ν e 1/σ 2 − 1. Thus, the overall privacy loss in the form of the double conjugate satisfies As such, the central limit theorem demonstrates that NoisySGD is approximately p T (e 1/σ 2 − 1)-GDP. Denoting by B = pn the mini-batch size, the privacy parameter p T (e 1/σ 2 − 1) equals B n T (e 1/σ 2 − 1). Intuitively, this reveals that NoisySGD gives good privacy guarantees if B √ T /n is small and σ is not too small.
As an aside, we remark that this new privacy analysis is different from the one performed in Section 5 of [17]. Therein, the authors consider Algorithm 1 with uniform subsampling and obtain a privacy bound that is different from the one in the present paper.
Next, we present a private version of Adam [32] in Algorithm 2, which we refer to as NoisyAdam and can be found in [2]. This algorithm has the same privacy bound as NoisySGD in the f -DP framework. In short, this is because the momentum m t and u t are deterministic functions of the noisy gradients and no additional privacy cost is incurred due to the post-processing property of differential privacy. In passing, we remark that the same argument applies to AdaGrad [19] and therefore it is also asymptotically GDP in the same asymptotic regime.

Comparisons with the Moments Accountant
It is instructive to compare the moments accountant with our privacy analysis performed in Section 3.1 using the f -DP framework. Developed in [3], the moments accountant gives a tight oneto-one mapping between ε and δ for specifying the overall privacy loss in terms of (ε, δ)-DP under composition, which is beyond the reach of the advanced composition theorem [23]. As abuse of notation, this paper uses functions δ MA = δ MA (ε) and ε MA = ε MA (δ) to denote the mapping induced by the moments accountant in both directions 5 . For self-containedness, the appendix includes a formal description of the two functions.
Although NoisySGD and NoisyAdam are our primary focus, our following discussion applies to general iterative algorithms where composition must be addressed in the privacy analysis. Let algorithm M t be f t -DP for t = 1, . . . , T and write M for their composition. On the one hand, the moments accountant technique ensures that M is (ε, δ MA (ε))-DP for any ε or, put equivalently, is (ε MA , δ)-DP 6 . On the other hand, the composition algorithm is f 1 ⊗ · · · ⊗ f T -DP from the f -DP viewpoint and, following from the central limit theorem (2), this composition can be shown to be approximately GDP in a certain asymptotic regime. For example, both NoisySGD and NoisyAdam presented in Algorithm 1 and Algorithm 2, respectively, asymptotically satisfy µ CLT -GDP with privacy parameter µ CLT = p T (e 1/σ 2 − 1).
In light of the above, it is tempting to ask which of the two approaches yields a sharper privacy analysis. In terms of f -DP guarantees, it must be the latter, which we refer to as the CLT approach, because the composition theorem of f -DP is tight 7 and, more importantly, the privacy central limit theorem is asymptotic exact. To formally state the result, note that the moments accountant asserts that the private optimizer is (ε, δ MA (ε))-DP for all ε 0, which is equivalent to sup ε 0 f ε,δMA(ε) -DP by recognizing (1) (see also Proposition 2.11 in [17]). Roughly speaking, the following theorem says that sup ε 0 f ε,δMA(ε) -DP (asymptotically) promises no more privacy guarantees than the bound of µ CLT -GDP given by the CLT approach. This simple result is summarized by the following theorem and see Appendix B for a formal proof of this result. Remark 1. For ease of reading, we point out that, in the (ε, δ)-DP framework, the smaller ε, δ are, the more privacy is guaranteed. In contrast, in the f -DP framework, the smaller f is, the less privacy is guaranteed.
Theorem 2 (Comparison in (ε, δ)-DP). Under the assumptions of Theorem 1, the f -DP framework gives an asymptotically sharper privacy analysis of both NoisySGD and NoisyAdam than the moments accountant in terms of (ε, δ)-DP. That is, for all ε 0.
In words, the CLT approach in the f -DP framework allows for an asymptotically smaller δ than the moments accountant at the same ε. It is worthwhile mentioning that the inequality in this theorem holds for any finite T if δ is derived by directly applying the duality to the (exact) privacy bound f 1 ⊗ · · · ⊗ f T . Equivalently, the theorem says that lim sup T →∞ (ε CLT (δ) − ε MA (δ)) < 0 for any δ 10 . As such, by setting the same δ in both approaches, say δ = 10 −5 , the f -DP based CLT approach shall give a smaller value of ε. 7 See the discussion following Theorem 3.2 in [17]. 8 See Section 2.4 of [17] for this result. See also [24,6]. 9 Here, ε(δ; µ) is the inverse function of δ(ε; µ). 10 Write δ CLT = δCLT(0) and set εCLT(δ) = 0 for δ δ CLT . Apply the same adjustment for εMA.
Put differently, we can carefully adjust some parameters in Algorithm 1 and Algorithm 2 in order to let the algorithms beμ CLT -GDP. For example, we can reduce the scale of the added noise from σ to a certainσ < σ, which can be solved from (7) and Note that this is adapted from (5).
(8) Figure 1: An illustration of the CLT approach in the f -DP framework and the moments accountant in the (ε, δ)-DP framework. NoisyOptimizer(σ, . . . ) using the moments accountant gives the same privacy guarantees in terms of (ε, δ)-DP as NoisyOptimizer(σ, . . . ) using the CLT approach (the ellipses denote omitted parameters). Note that the duality formula (6) is used in solvingμ CLT from (7). Figure 1 shows the flowchart of the privacy analyses using the two approaches and their relationship. In addition, numerical comparisons are presented in Figure 2, consistently demonstrating the superiority of the CLT approach.

Results
In this section, we use NoisySGD and NoisyAdam to train private deep learning models on datasets for tasks ranging from image classification (MNIST), text classification (IMDb movie review), recommender systems (MovieLens movie rating), to regular binary classification (Adult income). Note that these datasets all contain sensitive information about individuals, and this fact necessitates privacy consideration in the training process. A git repository with code to reproduce the results is available at https://github.com/woodyx218/Deep-Learning-with-GDP.

The f -DP Perspective
This section demonstrates the utility and practicality of the private deep learning methodologies with associated privacy guarantees in terms of f -DP. In Section 4.2, we extend the empirical study to the (ε, δ)-DP framework.
MNIST. The MNIST dataset [34] contains 60,000 training images and 10,000 test images. Each image is in 28 × 28 gray-scale representing a handwritten digit ranging from 0 to 9. We train neural networks with the same architecture (two convolutional layers followed by one dense layer) as in [2,3] on this dataset. Throughout the experiment, we set the subsampling probability to p = 256/60000 and use a constant learning rate η. Table 1 displays the test accuracy of the neural networks trained by NoisySGD as well as the associated privacy analyses. The privacy parameters ε in the last two columns are both with respect to δ = 10 −5 . Over all six sets of experiments with different tuning parameters, the CLT approach gives a significantly smaller value of ε than the moments accountant, which is consistent with our discussion in Section 3.2. The point we wish to emphasize, however, is that f -DP offers a much more comprehensive interpretation of the privacy guarantees than (ε, δ)-DP. For instance,  Table 1: Experimental results for NoisySGD and their privacy analyses on MNIST. The accuracy is averaged over 10 independent runs. The hyperparameters in the first three rows are the same as in [2]. The µ in the 6th row is calculated using (5), which carries over to the 7th row via (6) with δ = 10 −5 . The number of epochs is equal to T × mini-batch size/n = pT .
the model from the third row preserves a decent amount of privacy since it is not always easy to tell apart N (0, 1) and N (1.13, 1). In stark contrast, the (ε, δ)-DP viewpoint is too conservative, suggesting that for the same model not much privacy is left, due to a very large "likelihood ratio" e ε in Definition 2.1: it equals e 7.10 = 1212.0 or e 5.07 = 159.1 depending on which approach is chosen. This shortcoming of (ε, δ)-DP cannot be overcome by taking a larger δ, which, although gives rise to a smaller ε, would undermine the privacy guarantee from a different perspective.
For all experiments described in Table 1, Figure 3 illustrates the privacy bounds given by the CLT approach and the moments accountant both in terms of trade-off functions. The six plots in the first and third rows are with respect to δ = 10 −5 , from which the f -DP framework is seen to provide an analyst with substantial improvements in the privacy bounds. For the model corresponding to 96.6% test accuracy, concretely, the minimum sum of type I and type II errors in the sense of hypothesis testing is (at least) 77.6% by the CLT approach, whereas it is merely (at least) 9.4% by the moments accountant. For completeness, we show the optimal trade-off functions over all pairs of ε, δ given by the moments accountant in the middle row. The gaps between the two approaches exist, as predicted by Theorem 1, and remain significant.
Next, we extend our experiments to other datasets to further test f -DP for training private neural networks. The experiments compare private models under the privacy budget µ ≤ 2 to their non-private counterparts and some popular baseline methods. For simplicity, we focus on shallow neural networks and leave the investigation of complex architectures for future research.
Adult income. Originally from the UCI repository [18], the Adult income dataset has been preprocessed into the LIBSVM format [12]. Our model is a single-layer multi-perceptron with 16 neurons and the ReLU activation. We set σ = 0.55, p = 256/29305, η = 0.15, R = 1, and use NoisySGD as our optimizer. The results displayed in Table 2 show that our private model achieves comparable performance to the baselines in the MLC++ library [33] in terms of test accuracy.  Table 1. The plots are different from Figure 7 in [17]. The (ε, δ)-DP guarantees are plotted according to (1). The blue regions in the plots from the second row correspond to all pairs of (ε, δ) computed by MA. The blue regions are not noticeable in the third row.
IMDb. We use the IMDb movie review dataset [38] for binary sentiment classification (positive or negative reviews). The dataset contains 25,000 training and 25,000 test examples. In our experiments, we prepocess the dataset by only including the top 10,000 frequently used words and discard the rest. Next, we set every example to have 256 words by truncating the length or filling with zeros if necessary.
In our neural networks, the input is first embedded into 16 units and then is passed through a   Table 3: Results for NoisyAdam on the IMDB dataset, with δ = 10 −5 used in the privacy analyses.
MovieLens. The MovieLens movie rating dataset [27] is a benchmark dataset for recommendation tasks. Our experiments consider the MovieLens 1M dataset, which contains 1,000,209 movie ratings from 1 star to 5 stars. In total, there are 6,040 users who rated 3,706 different movies. For this multi-class classification problem, the root mean squared error (RMSE) is chosen as the performance measure. It is worthwhile to mention that, as each user only watched a small fraction of all the movies, most (user, movie) pairs correspond to missing ratings. We randomly sample 20% of the examples as the test set and take the remainder as the training set. Our model is a simplified version of the neural collaborative filtering in [28]. The network architecture consists of two branches. The left branch applies generalized matrix factorization to embed the users and movies using five latent factors. The output of the user embedding is multiplied by the item embedding. In the right branch, we use 10 latent factors for embedding. The embedding from both branches are then concatenated, which is fed to a fully-connected output layer. We set σ = 0.6, p = 1/80, η = 0.01, and R = 5 in NoisyAdam. Table 4 presents the numerical results of our neural networks as well as baseline models in the Suprise library [30] Table 4: Results for NoisyAdam on the MovieLens 1M dataset, with δ = 10 −6 used in the privacy analyses. CF stands for collaborative filtering.
the private model still outperforms many popular non-private models, including the user-based collaborative filtering and nonnegative matrix factorization.

The (ε, δ)-DP Perspective
While we hope that the f -DP perspective has been conclusively demonstrated to be advantageous, this section shows that the CLT approach continues to bring considerable benefits even in terms of (ε, δ)-DP. Specifically, by making use of the comparisons between the CLT approach and the moments accountant in Section 3.2, we can add less noise to the gradients in NoisySGD and NoisyAdam while achieving the same (ε, δ)-DP guarantees provided by the moments accountant.
With less added noise, conceivably, an optimizer would have a higher prediction accuracy. Figure 4 illustrates the experimental results on MNIST. In the top two plots, we set the noise scales to σ = 1.3,σ = 1.06, which are both shown to give (1.34, 10 −5 )-DP at epoch 20 using the moments accountant and the CLT approach, respectively. The test accuracy associated with the CLT approach is almost always higher than that associated with the moments accountant. In addition, another benefit of taking the CLT approach is that it gives rise to stronger privacy protection before reaching epoch 20, as shown by the right plot. For the bottom plots, although the improvement in test accuracy at the end of training is less significant, the CLT approach leads to much faster convergence at early epochs. To be concrete, the numbers of epochs needed to achieve 95%, 96%, and 97% test accuracy are 18, 26, and 45, respectively, for the neural networks with less noise, whereas the numbers of epochs are 23, 33, and 64, respectively, using noise level that is computed by the moments accountant. In a similar vein, the moments accountant gives a test accuracy of 92% for the first time when ε = 4 and the CLT approach achieves 96% under the same privacy budget.

Discussion
In this paper, we have showcased the use of f -DP, a very recently proposed privacy definition, for training private deep learning models using SGD or Adam. Owing to its strength in handling composition and subsampling and the powerful privacy central limit theorem, the f -DP framework allows for a closed-form privacy bound that is sharper than the one given by the moments accountant in the (ε, δ)-DP framework. By numerical experiments, we show that the trained neural networks can be quite private from the f -DP viewpoint (for instance, 1.13-GDP 11 ) but are not in the (ε, δ)-DP sense due to over conservative privacy bounds (for instance, (7.10, 10 −5 )-DP) computed in the (ε, δ)-DP framework. This in turn suggests that one can add less noise during the training process while having the same privacy guarantees as using the moments accountant, thereby improving model utility. We conclude this paper by offering several directions for future research. As the first direction, we may consider using time-dependent noise scales and learning rates in NoisySGD and NoisyAdam for a better tradeoff between privacy loss and utility in the f -DP framework. Note that [35] has made considerable progress using concentrated differential privacy along this line. More generally, a straightforward but interesting problem is to extend this work to complex neural network architectures with a variety of optimization strategies. For example, can we develop some guidelines for choosing an optimizer among NoisySGD, NoisyAdam, and others for a given classification problem under some privacy constraint? Empirically, deep learning models are very sensitive to hyperparameters such as mini-batch size in terms of test accuracy. Therefore, from a practical standpoint, it would be of great importance to incorporate hyperparameter tuning into the f -DP framework [26]. Given f -DP's good interpretability and powerful toolbox, it is worthwhile investigating whether, from a broad perspective, its superiority over earlier differential privacy relaxations would hold in general private statistical and machine learning tasks. We look forward to more research efforts to further the theory and extend the use of f -DP.

A Omitted Details in Section 2
We present Equation (3) as the following proposition, which is given in Section 2 but not in the foundational work [17].
Proof. We first write the two distributions M • Sample p (S) and M • Sample p (S ) as mixtures.
Without loss of generality, we can assume S = {x 1 , . . . , x n } and S = {x 0 , x 1 , . . . , x n }. An outcome of the process Sample p when applied to S is a bit string b = (b 1 , . . . , b n ) ∈ {0, 1} n . Bit b i dependes on whether x i is selected into the subsample. We use S b ⊆ S to denote the subsample determined by b. When each b i is sampled from a Bernoulli(p) distribution independently, S b can be identified with Sample p (S). Let θ b be the probability that b appears. More specifically, if k out of n entries of b is one, then θ b = p k (1 − p) n−k . With this notation, M • Sample p (S) can be written as the following mixture: Similarly, M • Sample p (S) can also be written as a mixture, with an additional bit indicating the presence of x 0 . Alternatively, we can divide the components into two groups: one with x 0 present, and the other with x 0 absent. Namely, Note that S b ∪ {x 0 } and S b are neighbors, i.e. M • Sample p (S ) is the mixture of neighboring distributions. The following lemma is the perfect tool to deal with it.
Lemma A.2. Let I be an index set. For all i ∈ I, P i and Q i are distributions that reside on a common sample space. (θ i ) i∈I is a collection of non-negative numbers that sums to 1. If f is a trade-off function and T (P i , Q i ) f for all i, then To apply the lemma, let the index be b ∈ {0, 1} n , P i be M (S b ) and Q i be M (S b ∪ {x 0 }). Condition T (P i , Q i ) f is the consequence of M being f -DP. The conclusion simply translates to which is what we want. The proof is complete.
It is easy to see that Since f is convex, Jensen's inequality implies Next we use a figure to justify the claim we made in Section 2.2 that "CLT approximation works well for SGD". Recall that we argued in Section 3 that Algorithms 1 and 2 are min{f, This function converges to G µ with µ = ν e 1/σ 2 − 1 as T → ∞ provided p √ T → ν. In the following figure, we numerically compute f (blue dashed) and compare it with the predicted limit G µ (red solid). More specifically, the configuration is designed to illustrate the fast convergence in the setting of the second line of Table 1, i.e. noise scale σ = 1.1, final GDP parameter µ = 0.57 and test accuracy 96.6%. Originally the algorithm runs 60 epochs, i.e. ≈ 14k iterations. To best illustrate that convergence appears in early stage, the numerical evaluation uses a much smaller T numeric = 234, i.e. only one epoch. In order to make the final limit consistent, we also enlarge the sample probability to p numeric so that p numeric · √ T numeric remain the same. We have to remark that when σ is small, µ = ν e 1/σ 2 − 1 gets large and yields challenges in the numerical computation of pG 1/σ + (1 − p)Id ⊗T . We leave rigorous and complete study to future work. Note that we cannot say M is f p -DP because T M (S ), M (S) is not necessarily lower bounded by f p . So we need a more specific composition theorem than stated in [17]. Then the composition The theorem can be identically proved as Theorem 3.2 in [17]. Taking Algorithm 1 as an example, since is simply the composition of T copies of M , the above composition theorem implies that T NoisySGD(S), NoisySGD(S ) Moreover, T NoisySGD(S), NoisySGD(S ) f −1 . The two inequality let us conclude that any trade-off function of neighboring distributions must be lower bounded by at least one of f and f −1 , hence min{f, f −1 }, hence min{f, f −1 } * * . In other words, NoisySGD is min{f, f −1 } * * -DP.
For NoisyAdam, we argued that its privacy property is the same as NoisySGD in each iteration, so the above argument also applies, and we have the same conclusion.

B.2 Justifying CLT for Algorithms 1 and 2
The main purpose of this section is to show the following theorem Theorem 5. Suppose p depends on T and p √ T → ν. Then we have the following uniform convergence as T → ∞ where µ = ν · T (e 1/σ 2 − 1).
This theorem is the corollary of the following more general CLT on composition of subsample mechanisms and Lemma B.1 below.
Theorem 6. Suppose f is a trade-off function such that (1) f (0) = 1, (2) f (x) > 0, for all x < 1 and (3) 1 0 (f (x) + 1) 4 dx < +∞. Let f p = pf + (1 − p)Id as usual. Furthermore, assume p √ T → ν as T → ∞ for some constant ν > 0. Then we have the uniform convergence Lemma B.1. We have In order to prove Theorem 6, we need an even more general CLT. The first privacy CLT was introduced in [17]. However, that version is valid only when each component trade-off function is symmetric, which is not true for pG 1/σ + (1 − p)Id. In order to state the general CLT that applies to asymmetric trade-off functions, we need to introduce the following functionals: Theorem 7. Let {f ni : 1 i n} ∞ n=1 be a triangular array of (possibly asymmetric) trade-off functions and assume the following limits for some constants K 0 and s > 0 as n → ∞: uniformly for all α ∈ [0, 1].
Proof of this theorem exactly mimics that of Theorem 3.5 in [17], which we omit here for its length and tediousness.
Next, we apply the asymmetric CLT to pf + (1 − p)Id ⊗T and prove Theorem 6. We start by collecting the necessary expressions into the following lemma. All of them are straightforward. Then Proof of Theorem 6. It suffices to compute the limits in the asymmetric Central Limit Theorem 7, namely Since T ∼ p −2 , we can consider p −2 kl(f p ) +kl(f p ) and so on.
The assumption expressed in terms of g is simply 1 0 g(x) 4 dx < +∞.
In particular, it implies |g(x)| k is integrable in [0, 1] for k = 2, 3, 4. In addition, by Lemma B.2, For the functional kl, by Lemma B.2, Changing the order of the limit and the integral in (9) is approved by the dominated convergence theorem. To see this, notice that log(1 + x) x. The integrand in (9) satisfies 0 g(x) · 1 p log 1 + pg(x) g(x) 2 .
We already argued that g(x) 2 is integrable, so it works as a dominating function and the limit is justified. When p √ T → ν, we have So the constant K in Theorem 7 is ν 2 · χ 2 (f ). For the functional κ 2 we have By a similar dominating function argument, Adding in the limit p √ T → ν, we know s 2 in Theorem 7 is ν 2 · χ 2 (f ). The same argument involving g(x) 3 and g(x) 4 applies to the functional κ 3 andκ 3 respectively and yields Note the different power in p in the denominator. It means κ 3 (f p ) = o(p 2 ) and hence T · κ 3 (f p ) → 0 when p √ T → ν. Hence all the limits in Theorem 7 check and we have a G µ limit where µ = K/s = s = ν 2 · χ 2 (f ) = ν · χ 2 (f ).
This completes the proof.
We finish the section by proving the formula in Lemma B.1.
Proof of Lemma B.1. The best calculation is done via better understanding. We point out that the functional χ 2 is doing nothing more than computing the famous χ 2 -divergence. Recall that Neyman χ 2 -divergence (reverse Pearson) of P, Q is defined as 3. If f = T (P, Q) and f (0) = 1, f (x) > 0, for all x < 1, then χ 2 (f ) = χ 2 (P Q). This lemma is a straightforward corollary of Proposition B.4 in [17], which gives expressions for all F -divergence 12 . In particular, if f = T (P, Q) and f (0) = 1, f (x) > 0, ∀x < 1, then F -divergence of P, Q can be computed from their trade-off function as follows: −1 · f (x) dx. 12 We use capital F to avoid confusion with the notation of trade-off function.

B.3 Proof of Theorems 1 and 2
Recall that Theorems 1 and 2 compare our CLT approach to moments accountant (MA) from two different perspectives: f -DP perspective in Theorem 1 and (ε, δ)-DP perspective in Theorem 2. We first show that Theorem 1 can be derived from Theorem 2. Then we prove a refined version of Theorem 2.
To be more precise about the statement, let us first expand the notations used in the main text. Let δ MA (ε; σ, p, T ) be the δ value computed by moment accountant method (described in detail below) for NoisySGD algorithm with subsampling probability p, iteration T and noise scale σ. Similarly, δ CLT (ε; σ, ν) denotes the δ value computed for the same algorithm using central limit theorem assuming p √ T → ν. Let f T (α) = sup ε 0 f ε,δMA(ε) (α). It is supported by f ε T ,δMA(ε T ) at α. Theorem 2 says this supporting function is smaller than that of G µCLT at α by a strict gap. Taking the limit, lim sup T →∞ f T (α) has at least that much gap from G µCLT (α), which proves Theorem 1.
Theorem 2 is a straightforward corollary of the following proposition. Note that the inequality is reversed compared to the statement of Theorem 2 so that the gap is positive, which also turns lim sup into lim inf. where µ = ν · e 1/σ 2 − 1.