Skip to main content
SearchLoginLogin or Signup

Private Prediction Sets

Published onApr 28, 2022
Private Prediction Sets
·

Abstract

In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals’ privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification—they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately this is not the case. To remedy this key problem, we develop a method that takes any pretrained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision data sets.

Keywords: privacy, uncertainty quantification, conformal prediction, prediction sets


Introduction

The impressive predictive accuracies of black-box machine learning algorithms on tightly controlled test beds do not sanctify their use in consequential applications. For example, given the gravity of medical decision-making, automated diagnostic predictions must come with rigorous instance-wise uncertainty to avoid silent, high-consequence failures. Furthermore, medical data science requires privacy guarantees, since individuals would suffer material harm were their data to be accessed or reconstructed by a nefarious actor. While uncertainty quantification and privacy are generally dealt with in isolation, they arise together in many real-world predictive systems, and, as we discuss, they interact. Accordingly, the work that we present here involves a framework that addresses uncertainty and privacy jointly. Specifically, we develop a differentially private version of conformal prediction that results in private, rigorous, finite-sample uncertainty quantification for any model and any data set at little computational cost.

Figure 1. Modular data science.

This work takes a modular viewpoint on data science: we seek to give practitioners the flexibility to train whatever underlying private model gives the best performance (e.g., via deep learning), then later endow the model with rigorous statistical properties, without modifying that underlying model. See Figure 1. Zooming out, perhaps the most consistent trend in the history of engineering is that modular engineering building blocks—like integrated circuits or reaction chemistry—are key to scalable and deployable systems where each part can be improved and debugged separately. Private prediction sets allow simple, private uncertainty quantification for any system and thus provide conceptual building blocks for data scientists constructing such systems.

Turning to the details, our approach builds on the notion of prediction sets—subsets of the response space that provably cover the true response variable with prespecified probability (e.g., 90%90\%). Formally, for a test point with feature vector XXX \in \mathcal{X} and response YYY \in \mathcal{Y}, we compute an uncertainty set function, C()\mathcal{C}(\cdot), mapping a feature vector to a subset of Y\mathcal{Y} such that

P ⁣{YC(X)}1α,\mathbb{P}\!\left\{{Y \in \mathcal{C}(X)}\right\} \ge 1 - \alpha, (1)

for a user-specified confidence level 1α1-\alpha, where α(0,1)\alpha \in (0,1). We use the output of an underlying predictive model (e.g., a pretrained, privatized neural network) along with a held-out calibration data set, {(Xi,Yi)}i=1n\{(X_i,Y_i)\}_{i=1}^n, from the same distribution as (X,Y)(X,Y) to fit the set-valued function C()\mathcal{C}(\cdot). The probability in expression 1 is therefore taken over both the randomness in (X,Y)(X,Y) and {(Xi,Yi)}i=1n\{(X_i,Y_i)\}_{i=1}^n. If the underlying model expresses uncertainty, C\mathcal{C} will be large, signaling skepticism regarding the model’s prediction.

Moreover, we introduce a differentially private mechanism for fitting C\mathcal{C}, such that the sets that we compute have low sensitivity to the removal of any calibration point. This will allow an individual to contribute a calibration data point without fear that the prediction sets will reveal their sensitive information. Note that even if the underlying model is trained in a privacy-preserving fashion, this provides no privacy guarantee for the calibration data. Therefore, we will provide an adjustment that masks the calibration data set with additional randomness, addressing both privacy and uncertainty simultaneously.

Figure 2: Examples of private conformal prediction sets on COVID-19 data. We show three examples of lung X-rays taken from the CoronaHack data set (Perez et al., 2020) with their corresponding private prediction sets at α = 10% from a ResNet-18. All three patients had viral pneumonia (likely COVID-19). The classes in the prediction sets appear in ranked order according to the softmax score of the model; the center and right images are incorrectly classified if the predictor returns only the most likely class, but are correctly covered by the private prediction sets. See Experiment 4.4 for details.

See Figure 2 for a concrete example of private prediction sets applied to the automated diagnosis of COVID-19. In this setting, the prediction sets represent a set of plausible diagnoses based on an X-ray image—either viral pneumonia (presumed COVID-19), bacterial pneumonia, or normal. We guarantee that the true diagnosis is contained in the prediction set with high probability, while simultaneously ensuring that an adversary cannot detect the presence of any one of the X-ray images used to train the predictive system.

1.1. Our contribution

Our main contribution is a privacy-preserving algorithm that takes as input any predictive model together with a calibration data set, and outputs a set-valued function C()\mathcal{C}(\cdot) that maps any input feature vector XX to a set of labels such that the true label YY is contained in the predicted set with probability at least 1α1-\alpha, as per Equation 1. In order to generate prediction sets satisfying this property, we use ideas from split conformal prediction (J. Lei et al., 2018; Papadopoulos et al., 2002; Vovk et al., 2005), modifying this approach to ensure privacy. Importantly, if the provided predictive model is also trained in a differentially private way, then the whole pipeline that maps data to a prediction set function C()\mathcal{C}(\cdot) is differentially private as well.

In Algorithm 1, we sketch our main procedure.

Algorithm 1. Private prediction sets (informal)

input: predictor f^()\hat{f}(\cdot), calibration data {(Xi,Yi)}i=1n\{(X_i,Y_i)\}_{i=1}^n, privacy level ϵ>0\epsilon>0, confidence level α(0,1)\alpha\in(0,1)

For 1in1\leq i\leq n, compute conformity score si=Sf^(Xi,Yi)s_i = S_{\hat f}(X_i,Y_i)

Compute ϵ\epsilon-differentially private (1α+O((nϵ)1))(1-\alpha + O((n\epsilon)^{-1}))-quantile of {si}i=1n\{s_i\}_{i=1}^n, denoted s^\hat s

output: C()={y:Sf^(,y)s^}\mathcal{C}(\cdot) = \{y:S_{\hat f}(\cdot,y)\leq \hat s\}

Algorithm 1 first computes the conformity scores for all training samples. Informally, these scores indicate how well a feature–label pair ‘conforms’ to the provided model f^\hat{f}, a low score implying high conformity and a high score being indicative of an atypical point from the perspective of f^\hat{f}. Then, the algorithm generates a certain carefully chosen private quantile of the scores. Finally, it returns a prediction set function C()\mathcal{C}(\cdot) which, for a given input feature vector, returns all labels that result in a conformity score below the critical threshold s^\hat s.

Our main theoretical result asserts that Algorithm 1 has strict coverage guarantees and is differentially private. In addition, we show that the coverage is almost tight, that is, not much higher than 1α1-\alpha.

Theorem 1 (Informal preview). The prediction set function C()\mathcal{C}(\cdot) returned by Algorithm 1 is ϵ\epsilon-differentially private and satisfies

1αP ⁣{YC(X)}1α+O((nϵ)1).1 - \alpha \leq \mathbb{P}\!\left\{{Y\in \mathcal{C}(X)}\right\} \leq 1-\alpha + O((n\epsilon)^{-1}).

We obtain a gap between the lower and upper bound on the probability of coverage to be roughly of the order O((nϵ)1)O((n\epsilon)^{-1}), similar to the standard gap O(n1)O(n^{-1}) without the privacy requirement. With this, we provide the first theoretical insight into the cost of privacy in conformal prediction. To shed further light on the properties of our procedure, we perform an extensive empirical study where we evaluate the tradeoff between the level of privacy on one hand, and the coverage and size of prediction sets on the other.

Differential privacy (Dwork et al., 2006) has become the de facto standard for privacy-preserving data analysis, as witnessed by its widespread adoption in large-scale systems such as those by Google (Bittau et al., 2017; Erlingsson et al., 2014), Apple (2017), Microsoft (Ding et al., 2017), and the US Census Bureau (Abowd, 2018; Dwork, 2019). This increasing adoption of differential privacy goes hand in hand with steady progress in differentially private model training, ranging across both convex (Bassily et al., 2014; Chaudhuri et al., 2011) and nonconvex (Abadi et al., 2016; Neel et al., 2020) settings. Our work complements these works by proposing a procedure that can be combined with any differentially private model training algorithm to account for the uncertainty of the resulting predictive model by producing a prediction set function with formal guarantees. At a technical level, closest to our algorithm on the privacy side are existing methods for reporting histograms and quantiles in a privacy-preserving fashion (Dwork et al., 2006; Feldman & Steinke, 2017; J. Lei, 2011; Smith, 2011; Xu et al., 2013). Finally,
there have also been significant efforts to quantify uncertainty with formal privacy guarantees through various types of private confidence intervals (Gaboardi et al., 2019; Karwa & Vadhan, 2017; Sheffet, 2017; Wang et al., 2019). While prediction sets resemble confidence intervals, they are fundamentally different objects as they do not aim to cover a fixed parameter of the population distribution, but rather a randomly sampled outcome. As a result, existing methods for differentially private confidence intervals do not generalize to our problem setting.

Prediction sets as a way to represent uncertainty are a classical idea, going back at least to tolerance regions in the 1940s (Tukey, 1947; Wald, 1943; Wilks, 1941, 1942). See Krishnamoorthy & Mathew (2009) for an overview of tolerance regions and Park et al. (2020) for a recent application to deep learning models. Conformal prediction (Shafer & Vovk, 2008; Vovk et al., 1999, 2005) is a related way of producing predictive sets with finite-sample guarantees. Most relevant to the present work, split conformal prediction (J. Lei et al., 2015, 2018; Papadopoulos et al., 2002) is a convenient version that uses data splitting to give prediction sets in a computationally efficient way. Vovk (2015) and Barber et al. (2021) refine this approach to reuse data for both training and calibration, improving statistical efficiency. Recent work has targeted desiderata such as small set sizes (Angelopoulos et al., 2020; Sadinle et al., 2019), coverage that is approximately balanced across feature space (Cauchois, Gupta, & Duchi, 2020; Barber et al., 2019; Guan, 2020; Izbicki et al., 2019; Romano et al., 2019; Romano et al., 2020; Vovk, 2012), and coverage that is balanced across classes (Guan & Tibshirani, 2019; Hechtlinger et al., 2018; J. Lei, 2014; Sadinle et al., 2019). Further extensions address problems in distribution estimation (Vovk et al., 2017, 2020), handling or testing distribution shift (Cauchois, Gupta, Ali, & Duchi, 2020; Hu & Lei, 2020; Tibshirani et al., 2019), causal inference (L. Lei & Candès, 2020), and controlling other notions of statistical error (Bates et al., 2021). We suggest (Angelopoulos & Bates, 2021) and (Shafer & Vovk, 2008) as introductory tutorials on conformal prediction for the unfamiliar reader. Lastly, we highlight two alternative approaches with a similar goal to conformal prediction. First, the calibration technique in (Jung et al., 2020) and (Gupta et al., 2021) generates prediction sets via the estimation of higher moments across many overlapping subpopulations. Second, there is a family of techniques that define a utility function balancing set-size and coverage and then search for set-valued predictors to maximize this utility (del Coz et al., 2009; Grycko, 1993; Mortier et al., 2020). The present work builds on split conformal prediction, but modifies the calibration step to preserve privacy.

2. Preliminaries

In this section, we formally introduce the main concepts in our problem setting. Split conformal prediction assumes access to a predictive model, f^\hat{f}, and aims to output prediction sets that achieve coverage by quantifying the uncertainty of f^\hat{f} and the intrinsic randomness in XX and YY. It quantifies this uncertainty using a calibration data set consisting of nn i.i.d. samples, {(Xi,Yi)}i=1n\{(X_i,Y_i)\}_{i=1}^n, that were not used to train f^\hat{f}. The calibration proceeds by defining a score function Sf^:X×YRS_{\hat{f}}: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}. Without loss of generality we take the range of this function to be the unit interval [0,1][0, 1]. The reader should think of the score as measuring the degree of consistency of the response YY with the features XX based on the predictive model f^\hat{f} (e.g., the size of the residual in a regression model), but any score function would lead to correct coverage. To simplify notation we will write S(,)S(\cdot,\cdot) to denote the score, where we implicitly assume an underlying model f^\hat{f}. From this score function, one forms prediction sets as follows:

C(x)={y:S(x,y)s^},{C}(x) = \{y : S(x, y) \le \hat{s}\}, (2)

for a choice of s^\hat{s} based on the calibration dataset. In particular, s^\hat{s} is taken to be a quantile of the calibration scores si=S(Xi,Yi)s_i = S(X_i, Y_i) for i=1,,ni=1,\dots,n. In nonprivate conformal prediction, one simply takes s^\hat{s} to be the ((n+1)(1α))/n\big((n+1)(1-\alpha)\big) / n quantile, and then a standard argument shows that the coverage property in (1) holds. In this work we show how to take a modified private quantile that maintains this coverage guarantee.

As a concrete example of standard split conformal prediction, consider classifying an image in X=Rm×d\mathcal{X}=\mathbb{R}^{m\times d} into one of a thousand classes, Y={1,...,1000}\mathcal{Y}=\{1,...,1000\}. Given a standard classifier outputting a probability distribution over the classes, f^:X[0,1]1000\hat{f}:\mathcal{X} \to [0,1]^{1000} (e.g., the output of a softmax layer), we can define a natural score function based on the activation of the correct class, S(x,y)=1f^(x)yS(x,y)=1-\hat f(x)_y. Then we take s^\hat{s} as the upper 0.9(n+1)/n\lceil 0.9 (n+1)\rceil / n quantile of the calibration scores s1,,sns_1,\dots, s_n and define C\mathcal{C} as in Equation 2. That is, we take as the cutoff s^\hat{s} the value such that if we include all classes with estimated probability greater than 1s^1 - \hat{s}, our sets have (only slightly more than) 90% coverage on the calibration data. The result C(x)\mathcal{C}(x) on a test point is then a set of plausible classes guaranteed to contain the true class with probability 90%. Our proposed method will follow a similar workflow, but with a slightly different choice of s^\hat{s} to guarantee both coverage and privacy.

We next formally define differential privacy. We say that two data sets D,D\mathcal{D},\mathcal{D}' are neighboring if they differ in a single element, that is, either data set can be obtained from the other by removing a single entry. For example, D(X×Y)n\mathcal{D}\in(\mathcal{X}\times\mathcal{Y})^n and D=D{(X0,Y0)}\mathcal{D}' = \mathcal{D}\setminus \{(X_0,Y_0)\}, for some (X0,Y0)D(X_0,Y_0)\in \mathcal{D}. Differential privacy then requires that two neighboring data sets produce similar distributions on the output.

Definition 1 (Differential privacy (Dwork et al., 2006)). A randomized algorithm A\mathcal{A} is ϵ\epsilon-differentially private if for all neighboring data sets D\mathcal{D} and D\mathcal{D}', it holds that:

P ⁣{A(D)O}eϵP ⁣{A(D)O},for all measurable sets O.\mathbb{P}\!\left\{{\mathcal{A}(\mathcal{D}) \in \mathcal{O}}\right\} \leq e^\epsilon \mathbb{P}\!\left\{{\mathcal{A}(\mathcal{D}') \in \mathcal{O}}\right\}, \text{for all measurable sets} \ \mathcal{O}.

In short, if no adversary observing the algorithm’s output can distinguish between D\mathcal{D} and a data set D\mathcal{D}' with the ii-th entry removed, the presence of individual ii in the analysis cannot be detected and hence their privacy is not compromised.

A key ingredient to our procedure is a privatized quantile of the conformity scores. We obtain this private quantile by discretizing the scores into bins and applying the exponential mechanism (McSherry & Talwar, 2007), one of the most ubiquitous tools in differential privacy. Our private quantile routine is then an extension of the private median routine proposed by Feldman and Steinke (2017) to handle arbitrary quantiles. Specifically, let us fix a number of bins mNm\in\mathbb{N}, as well as edges 0e0<e1<...<em1<em10 \equiv e_0 < e_1 < ... < e_{m-1} < e_m \equiv 1. The edges define the bins Ij=(ej1,ej]I_j = (e_{j-1}, e_j], j=1,...,mj=1, ..., m. We use Algorithm 2 with appropriately chosen quantile level qq as a subroutine of our main conformal procedure.

Algorithm 2. Differentially private quantile

input: calibration scores {s1,,sn}\{s_1,\dots,s_{n}\}, bins {I1,,Im}\{I_1,\dots, I_m\}, privacy level ϵ\epsilon, level q[0,1]q\in[0,1]

For all 1in1\leq i\leq n, compute discretized score [si]=ej[s_i] = e_j, where siIjs_i \in I_j

For all 1jm1\leq j\leq m, compute wj=max{#{i:[si]<ej}q,#{i:[si]>ej}1q}w_j = \max\left\{\frac{\#\{i: [s_i] < e_j\}}{q},\frac{\#\{i: [s_i] > e_j\}}{1-q}\right\}

Let s^=ej\hat s = e_j with probability eϵ2wj/j=1meϵ2wje^{-\frac{\epsilon}{2}w_j}/\sum_{j'=1}^m e^{-\frac{\epsilon}{2}w_{j'}}

output: private quantile s^\hat s

3. Algorithm and Guarantees

We next precisely state our main algorithm and its formal guarantees. First, our algorithm has a calibration step, Algorithm 3, carried out one time using the calibration scores s1,,sns_1,\dots,s_n as input; this is the heart of our proposed procedure. The output of this step is a cutoff s^\hat{s} learned from the calibration data. With this in hand, one forms the prediction set for a test point xx as in Equation 2, which for completeness we state in Algorithm 4.

Algorithm 3. Differentially private calibration

input: calibration scores {s1,,sn}\{s_1,\dots,s_{n}\}, privacy parameter ϵ\epsilon, coverage level α\alpha, bins {I1,,Im}\{I_1,\dots,I_m\}

Compute q~\tilde q-quantile of {s1,,sn}\{s_1,\dots,s_n\} via Algorithm 2, where q~\tilde q is defined in (3), denoted s^\hat s

output: calibrated score cutoff s^\hat{s}

Algorithm 4. Differentially private prediction set

input: test point xx, calibrated score cutoff s^\hat{s}

output: prediction set as in (2): C(x)={y:S(x,y)s^}\mathcal{C}(x) = \{y : S(x,y) \le \hat{s} \}.

This algorithm both satisfies differential privacy and guarantees correct coverage, as stated next in Proposition 1 and Theorem 2, respectively. The privacy property is a straightforward consequence of the privacy guarantees on the exponential mechanism (McSherry & Talwar, 2007).

Proposition 1 (Privacy guarantee). Algorithm 3 is ϵ\epsilon-differentially private.

Therefore, the main challenge for theory lies in understanding how to compensate for the added differentially private noise in order to get strict, distribution-free coverage guarantees.

Theorem 2 (Coverage guarantee). Fix the differential privacy level ϵ>0\epsilon>0 and miscoverage level α(0.5,1)\alpha\in (0.5,1), as well as a free parameter γ(0,1)\gamma\in(0,1). Let

q~=(n+1)(1α)n(1γα)+2ϵnlog(mγα),\tilde q = \frac{(n+1)(1-\alpha)}{n(1-\gamma\alpha)} + \frac{2}{\epsilon n}\log\left(\frac{m}{\gamma\alpha}\right),(3)

and let s^\hat s be the output of Algorithm 2 at level min{q~,1}\min\{\tilde q,1\}. Then, the prediction sets in (2) with cutoff s^\hat s satisfy the coverage property in (1).

Remark 1. We can choose γ\gamma to minimize q~\tilde q, which leads to smallest prediction sets. The optimal value γ\gamma^* depends only on n,m,n,m, and α\alpha, and can be found by taking a derivative of  (2); see Appendix C.

Note that the significance level q~\tilde{q} in (3) is just a slightly inflated version of the nonprivate conformal quantile: q~(n+1)(1α)n1α\tilde{q} \ge \frac{(n+1)(1-\alpha)}{n} \ge 1 - \alpha. Indeed, taking ϵ,γ0\epsilon \to \infty, \gamma \to 0 in (2) recovers the nonprivate quantile. Intuitively, we must raise the significance level to compensate for the noise introduced to preserve privacy. We note that the additive factor of order 1nϵ\frac{1}{n\epsilon} is in fact necessary to compute an approximate quantile with ϵ\epsilon-differential privacy (Bun et al., 2017).

We informally sketch the main ideas in the proof, deferring the details to the Appendix.

Proof sketch. We can write the probability of coverage as:

P ⁣{YC(X)}=E[F(s^)],\mathbb{P}\!\left\{{Y \in \mathcal{C}(X)}\right\} = \mathbb{E}\left[F(\hat s)\right],

where FF is the distribution of appropriately discretized empirical scores. We observe that for all q~\tilde q, the exponential mechanism with input q~\tilde q and s1,,sns_1,\dots,s_n returns an empirical quantile no smaller than the q~O(1/(nϵ))\tilde q - O(1/(n\epsilon)) empirical quantile. This allows us to write

E[F(s^)](1γα)E[F(F^1(q~O(1/(nϵ)))],\mathbb{E}\left[F(\hat s)\right] \geq (1-\gamma\alpha) \mathbb{E}\left[F(\hat {F}^{-1}(\tilde q - O(1/(n\epsilon)))\right],

where F^\hat F denotes the empirical distribution of the discretized scores. For any qq, the random variable F(F^1(q))F(\hat {F}^{-1}(q)) is distributed as the nq\lceil n q\rceil-th order statistic of a super-uniform distribution, which implies that it can be stochastically lower bounded by the nq\lceil n q\rceil-th order statistic of a uniform distribution. This order statistic follows a beta distribution with known parameters, whose expectation can hence be evaluated analytically. Carefully choosing q~\tilde q as a function of this expectation completes the proof of the theorem.

Figure 3. The private quantile as n and ϵ grow. We demonstrate the adjusted quantile from (3) as n and ϵ increase, with automatically chosen values for m and γ described in Appendix C. As the number of samples grows and the privacy constraint relaxes, the procedure chooses a less conservative quantile, eventually approaching the limiting value 1 − α. The mild fluctuations in the curves are due to differing choices of m* and γ.

With the validity of Algorithm 3 established, we next prove that the algorithm is not too conservative in the sense that the coverage is not far above 1α1-\alpha. A key quantity in our upper bound is

pmaxm:=max1jmP ⁣{s1Ij}.p_{\max}^m := \max_{1\leq j\leq m} \mathbb{P}\!\left\{{s_1 \in I_j}\right\}.

This quantity captures the impact of the score discretization. Smaller pmaxmp_{\max}^m corresponds to mass being spread more evenly throughout the bins. For well-behaved score functions, we expect pmaxmp_{\max}^m to scale as O(m1)O(m^{-1}). Indeed, if the scores have any continuous density on [0,1][0,1] bounded above and we take uniformly spaced bins, then pmaxm=O(m1)p_{\max}^m = O(m^{-1}). In terms of pmaxmp_{\max}^m, we have the following upper bound.

Theorem 3 (Coverage upper bound). The prediction sets in (2) with s^\hat s is as in Theorem 2, satisfy the following coverage upper bound:

P ⁣{YC(X)}1(1γ)α+(1γα)(2pmaxm+2(1+max{q~1q~,1})log(m/(γα))(n+1)ϵ),\mathbb{P}\!\left\{{Y \in \mathcal{C}(X)}\right\} \leq 1 - (1 - \gamma)\alpha + (1-\gamma\alpha)\left( 2p_{\textnormal{max}}^m + \frac{2\left(1 +\max\left\{\frac{\tilde q}{1-\tilde q},1\right\}\right)\log(m/(\gamma\alpha))}{(n+1) \epsilon }\right),

where q~\tilde q is defined in (3).

If we further assume a weak regularity condition on the scores, then by balancing the rates in the expression above we arrive at an explicit upper bound.

Corollary 1 (Coverage upper bound, simplified form). Suppose that the input scores follow a continuous distribution on [0,1][0,1] with a density that is bounded above. Take mnϵm \propto n\epsilon and γ=1/m\gamma = 1/m. Then, the prediction sets in (2), with s^\hat s as in Theorem 2, satisfy the following upper bound:

P ⁣{YC(X)}1α+O(log(nϵ/α)nϵ).\mathbb{P}\!\left\{{Y \in \mathcal{C}(X)}\right\} \leq 1 - \alpha + O\left(\frac{\log\big(n\epsilon/\alpha\big)}{n\epsilon}\right).

We emphasize that the assumptions on the score distribution are only needed to prove the upper bound; the coverage lower bound holds for any distribution. In any case, these assumptions are very weak, essentially requiring only that the score distribution contains no point masses. In fact, this requirement could even be enforced ex post facto by adding a small amount of tiebreaking noise, in which case we would need no restrictions on the input distribution of scores whatsoever.

The upper bound answers an important practical question: how many bins should we take? If mm is too small, then the histogram only coarsely approximates the empirical distribution of the scores. On the other hand, if mm is too large, then the histogram is accurate, but the private quantile in 3 can grow as well. This tension can be observed in the terms in Theorem 3 that have a dependence on mm. Corollary 1 suggests that the correct balance—which leads to minimal excess coverage—is to take mnϵm \propto n\epsilon. In practice, because the dependence of q~\tilde{q} on mm is only logarithmic, mm is often very large.

This upper bound also gives insight to an important theoretical question: what is the cost of privacy in conformal prediction? In nonprivate conformal prediction, the upper bound is 1α+O(n1)1-\alpha + O(n^{-1}) (J. Lei et al., 2018). In private conformal prediction, we achieve an upper bound of 1α+O~((nϵ)1)1-\alpha + \tilde O((n\epsilon)^{-1}), a relatively modest cost incurred by privacy-preserving calibration.

4. Experiments

We now turn to an empirical evaluation of differentially private conformal prediction for image classification problems. In this setting, each image XiX_i has a single unique class label Yi{1,...,K}Y_i \in \{1,...,K\} estimated by a predictive model f^:X[0,1]K\hat{f} : \mathcal{X} \to [0,1]^K. We seek to create private prediction sets, C(Xi){1,...,K}\mathcal{C}(X_i) \subseteq \{1,...,K\}, achieving coverage as in Equation 1, using the following score function:

S(x,y)=1f^(x)y,S(x,y) = 1 - \hat{f}(x)_y,

as in Sadinle et al. (2019). This section evaluates the prediction sets generated by Algorithm 3 by quantifying the cost of privacy and the effects of the model, number of calibration points, and number of bins used in our procedure. We use the CIFAR-10 data set (Krizhevsky & Hinton, 2009) wherever we require a privately trained neural network. Otherwise, we use a nonprivate model on the ImageNet data set (Deng et al., 2009), to investigate the performance of our procedure in a more challenging setting with a large number of possible labels. Except where otherwise mentioned, we use an automated number of uniformly spaced bins mm^* to construct the privatized CDF. Appendix C describes the algorithm for choosing an approximately optimal value of mm^* when the conformal scores are roughly uniform based on fixed values of nn, ϵ\epsilon, and α\alpha. We finish the section by providing private prediction sets for diagnosing viral pneumonia on the CoronaHack data set (Perez et al., 2020). The reader can reproduce the experiments exactly using our public GitHub repository.

4.1. Isolating the effects of private model training and private conformal prediction

We would like to disentangle the effects of private conformal prediction from those of private model training. To that end, we report the coverage and set sizes of the following four procedures: private conformal prediction with a private model, nonprivate conformal prediction with a private model, private conformal prediction with a nonprivate model, and nonprivate conformal prediction with a nonprivate model. The nonprivate model and private model are both the same stock convolutional architecture from the Opacus library. The private model is trained with private SGD (Abadi et al., 2016), as implemented in the Opacus library, with privacy parameters ϵ=8\epsilon=8 and δ=1e5\delta=1e-5. We used the suggested private model training parameters from the Opacus library (see Appendix C), as our work does not aim to improve private model training. The nonprivate model’s accuracy (73%73\%) was significantly higher than that of the private model (67%67\%).

Figure 4. Coverage and set size with private/nonprivate models and private/nonprivate conformal prediction. We demonstrate histograms of coverage and set size of nonprivate/private models and nonprivate/private conformal prediction at the level α = 0.1, with ϵ = 8, δ = 1e − 5, and n = 5000.

Figure 4 shows histograms of the coverages and set sizes of these procedures over 1,000 random splits of the CIFAR-10 validation set with n=5000n=5000. Notably, the results show the price of private conformal prediction is very low, as evidenced by the minuscule increase in set size caused by private conformal prediction. However, the private model training causes a larger set size due to the private model’s comparatively poor performance. Note that a user desiring a fully private pipeline will use the procedure in the bottom right quadrant of the plot.

4.2. Varying number of bins mm

Here we probe the performance of private prediction sets as the number of uniformly spaced bins mm in our procedure changes. Based on our theoretical results, mm should be on the order of nϵn\epsilon, with the exact number dependent on the underlying model and the choices of α\alpha, nn, and ϵ\epsilon. A too-small choice of mm coarsely quantizes the scores, so Algorithm 4 may be forced to round up to a very conservative private quantile. A too-large choice of mm increases the logarithmic term in 3. The optimal choice of mm balances these two factors.

Figure 5. Coverage and set size for different values of m. We demonstrate the performance on Imagenet of private conformal prediction using a nonprivate ResNet-152 as the base model at α = 0.1 and ϵ = 1. The coverage is nearly constant over three orders of magnitude of bin numbers. All of the histograms on the right hand side are overlapping. See Section 4.2 for details.

To demonstrate this tradeoff, we performed experiments on ImageNet. We used a nonprivate, pretrained ResNet-152 from the torchvision repository as the base model. Figure 5 shows the coverage and set size of private prediction sets over 100 random splits of ImageNet’s validation set for several choices of mm; we used n=30000n=30000 and evaluated on the remaining 2000020000 images. The experimental results suggest mm^* works comparatively well, and that our method is relatively insensitive to the number of bins over several orders of magnitude.

4.3. Varying privacy level ϵ\epsilon

Figure 6. Coverage and set size for different values of ϵ. We demonstrate the performance on ImageNet of private conformal prediction using a nonprivate ResNet-152 as the base model with α = 0.1. The coverage improves slightly for liberal (large) ϵ, although the cost of privacy is evidently very low. All of the histograms on the right hand side are overlapping. See Section 4.3 for details.

Next we quantify how the coverage changes with the privacy parameter ϵ\epsilon. We used n=30000n=30000 calibration points and 2000020000 evaluation points as in Experiment 4.3. For each value of ϵ\epsilon we choose a different value of mm^*. Figure 6 shows the coverage and set size of private prediction sets over 100 splits of ImageNet’s validation set for several choices of ϵ\epsilon. As ϵ\epsilon grows, the procedure becomes less conservative. Overall the procedure exhibits little sensitivity to ϵ\epsilon.

4.4. COVID-19 diagnosis

Figure 7. Coverage and set size on the CoronaHack data set. We demonstrate the performance on the CoronaHack data set of private conformal prediction using a nonprivate ResNet-18 as the base model with α = 0.1. The median coverage was 90.4%. See Section 4.4 for details.

Next we show results on the CoronaHack data set, a public chest X-ray data set containing 59085908 X-rays labeled as normal, viral pneumonia (primarily COVID-19), or bacterial pneumonia. Using 44084408 training pairs over 1414 epochs, we (nonprivately) fine-tuned the last layer of a pretrained ResNet-18 from torchvision to predict one of the three diagnoses. The private conformal calibration procedure saw a further n=1000n=1000 examples, and we used the remaining 500500 for validation. The ResNet-18 had a final accuracy of 75%75\% after fine-tuning. Figure 7 plots the coverage and set size of this procedure over 10001000 different train/calibration/validation splits of the dataset, and Figure 2 shows selected examples of these sets.

5. Discussion

We introduce a method to produce differentially private prediction sets that contain the true response with a user-specified probability by blending split conformal prediction with differentially private quantile computation. The primary challenge we resolve in this work is simultaneously satisfying the coverage property and privacy property, which requires a careful choice of the conformal score threshold to account for the added privacy noise. Our corresponding upper bound shows that the coverage does not greatly exceed the nominal level 1α1-\alpha, meaning that our procedure is not too conservative. Moreover, our upper bound gives insight into the price of privacy in conformal prediction: the upper bound scales as O~((nϵ)1)\tilde O((n \epsilon)^{-1}) compared to O(n1)O(n^{-1}) for nonprivate conformal prediction, a mild decrease in efficiency. This is confirmed in our experiments, where we show that there is little difference between private and nonprivate conformal prediction when using the same predictive model. We also observe the familiar phenomenon that there is a substantial decrease in accuracy for private model fitting compared to nonprivate model fitting. We conclude that the cost of privacy lies primarily in the model fitting—private calibration has a comparatively minor effect on performance. We also note that any improvement in private model training would immediately translate to smaller prediction sets returned by our method. In sum, we view private conformal prediction as an appealing method for uncertainty quantification with differentially private models.


Disclosure Statement

Anastasios Nikolas Angelopoulos, Stephen Bates, Tijana Zrnic, and Michael I. Jordan have no financial or non-financial disclosures to share for this article.


References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (pp. 308–318). https://doi.org/10.1145/2976749.2978318

Abowd, J. M. (2018). The U.S. census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (p. 2867). https://doi.org/10.1145/3219819.3226070

Angelopoulos, A. N., & Bates, S. (2021). A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv. https://doi.org/10.48550/arXiv.2107.07511

Angelopoulos, A. N., Bates, S., Malik, J., & Jordan, M. I. (2020). Uncertainty sets for image classifiers using conformal prediction. arXiv. https://doi.org/10.48550/arXiv.2009.14193

Barber, R. F., Candès, E. J., Ramdas, A., & Tibshirani, R. J. (2019). The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2), 455–482. https://doi.org/10.1093/imaiai/iaaa017

Barber, R. F., Candès, E. J., Ramdas, A., Tibshirani, R. J. (2021). Predictive inference with the jackknife+. Annals of Statistics, 49(1), 486–507. https://doi.org/10.1214/20-AOS1965

Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science (pp. 464–473). https://doi.org/10.1109/FOCS.2014.56

Bates, S., Angelopoulos, A., Lei, L., Malik, J., & Jordan, M. I. (2021). Distribution-free, risk-controlling prediction sets. arXiv. https://doi.org/10.48550/arXiv.2101.02703

Bittau, A., Erlingsson, U´ ., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., . . . Seefeld, B. (2017). Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th symposium on operating systems principles (pp. 441–459). https://doi.org/10.1145/3132747.3132769

Bun, M., Steinke, T., & Ullman, J. (2017). Make up your mind: The price of online queries in differential privacy. In Proceedings of the 28th annual ACM-SIAM symposium on discrete algorithms (pp. 1306–1325).

Cauchois, M., Gupta, S., Ali, A., & Duchi, J. C. (2020). Robust validation: Confident predictions even when distributions shift. arXiv. https://doi.org/10.48550/arXiv.2008.04267

Cauchois, M., Gupta, S., & Duchi, J. (2020). Knowing what you know: Valid and validated confidence sets in multiclass and multilabel prediction. arXiv. https://doi.org/10.48550/arXiv.2004.10181

Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(29), 1069–1109. https://jmlr.org/papers/v12/chaudhuri11a.html

del Coz, J. J., Díez, J., & Bahamonde, A. (2009). Learning nondeterministic classifiers. Journal of Machine Learning Research, 10(79), 2273–2293. http://jmlr.org/papers/v10/delcoz09a.html

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). https://doi.org/10.1109/CVPR.2009.5206848

Differential Privacy Team Apple. (2017). Learning with privacy at scale. Apple machine learning research. https://machinelearning.apple.com/research/learning-with-privacy-at-scale

Ding, B., Kulkarni, J., & Yekhanin, S. (2017). Collecting telemetry data privately. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017) (pp. 3571–3580). https://papers.nips.cc/paper/2017/hash/253614bbac999b38b5b60cae531c4969-Abstract.html

Dwork, C. (2019). Differential privacy and the US census. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems (p. 1). https://doi.org/10.1145/3294052.3322188

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In S. Halevi, & T. Rabin (Eds.), Theory of cryptography: Third theory of cryptography conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings (pp. 265–284). https://doi.org/10.1007/11681878_14

Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/0400000042

Erlingsson, Ú., Pihur, V., & Korolova, A. (2014). Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security (pp. 1054–1067). https://doi.org/10.1145/2660267.2660348

Feldman, V., & Steinke, T. (2017). Generalization for adaptively-chosen estimators via stable median. In Proceedings of Machine Learning Research: Vol. 65. Proceedings of the 2017 Conference on Learning Theory (pp. 728–757). https://proceedings.mlr.press/v65/feldman17a.html

Gaboardi, M., Rogers, R., & Sheffet, O. (2019). Locally private mean estimation: Z-test and tight confidence intervals. In Proceedings of Machine Learning Research: Vol. 89. The 22nd international conference on artificial intelligence and statistics (pp. 2545–2554). http://proceedings.mlr.press/v89/gaboardi19a.html

Grycko, E. (1993). Classification with set-valued decision functions. In O. Opitz, B. Lausen, & R. Klar (Eds.), Information and classification: Concepts, methods and applications (pp. 218–224). https://doi.org/10.1007/978-3-642-50974-2_22

Guan, L. (2020). Conformal prediction with localization. arXiv. https://doi.org/10.48550/arXiv.1908.08558

Guan, L., & Tibshirani, R. (2019). Prediction and outlier detection in classification problems. arXiv. https://doi.org/10.48550/arXiv.1905.04396

Gupta, V., Jung, C., Noarov, G., Pai, M. M., & Roth, A. (2021). Online multivalid learning: Means, moments, and prediction intervals. arXiv. https://doi.org/10.48550/arXiv.2101.01739

Hechtlinger, Y., Póczos, B., & Wasserman, L. (2018). Cautious deep learning. arXiv. https://doi.org/10.48550/arXiv.1805.09460

Hu, X., & Lei, J. (2020). A distribution-free test of covariate shift using conformal prediction. arXiv. https://doi.org/10.48550/arXiv.2010.07147

Izbicki, R., Shimizu, G. T., & Stern, R. B. (2019). Flexible distribution-free conditional predictive bands using density estimators. arXiv. https://doi.org/10.48550/arXiv.1910.05575

Jung, C., Lee, C., Pai, M. M., Roth, A., & Vohra, R. (2020). Moment multicalibration for uncertainty estimation. arXiv. https://doi.org/10.48550/arXiv.2008.08037

Karwa, V., & Vadhan, S. (2017). Finite sample differentially private confidence intervals. arXiv. https://doi.org/10.48550/arXiv.1711.03908

Krishnamoorthy, K., & Mathew, T. (2009). Statistical tolerance regions: Theory, applications, and computation. Wiley.

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images [Technical Report]. The University of Toronto. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Lei, J. (2011). Differentially private m-estimators. In NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems (pp. 361–369). https://papers.nips.cc/paper/2011/hash/f718499c1c8cef6730f9fd03c8125cab-Abstract.html

Lei, J. (2014). Classification with confidence. Biometrika, 101(4), 755–769. https://doi.org/10.1093/biomet/asu038

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111. https://doi.org/10.1080/01621459.2017.1307116

Lei, J., Rinaldo, A., & Wasserman, L. (2015). A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74(1–2), 29–43. https://doi.org/10.1007/s10472-013-9366-6

Lei, L., & Candès, E. J. (2020). Conformal inference of counterfactuals and individual treatment effects. arXiv. https://doi.org/10.48550/arXiv.2006.06138

McSherry, F., & Talwar, K. (2007). Mechanism design via differential privacy. In 48th annual IEEE symposium on foundations of computer science (pp. 94–103). https://doi.org/10.1109/FOCS.2007.66

Mortier, T., Wydmuch, M., Dembczyński, K., Hüllermeier, E., & Waegeman, W. (2020). Efficient set-valued prediction in multi-class classification. arXiv. https://doi.org/10.48550/arXiv.1906.08129

Neel, S., Roth, A., Vietri, G., & Wu, S. (2020). Oracle efficient private non-convex optimization. In Proceedings of Machine Learning Research: Vol. 119. Proceedings of the 37th International Conference on Machine Learning (pp. 7243–7252). http://proceedings.mlr.press/v119/neel20a.html

Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002). Inductive confidence machines for regression. In Proceedings of the 13th European Conference on Machine Learning (pp. 345–356). https://doi.org/10.1007/3-540-36755-1_29

Park, S., Bastani, O., Matni, N., & Lee, I. (2020). PAC confidence sets for deep neural networks via calibrated prediction [paper presentation]. The 2020 International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=BJxVI04YvB

Pérez, J. C., de Blas Pérez, C., Alvarez, F. L., & Contreras, J. M. C. (2020). Databiology Lab CORONAHACK: Collection of public COVID-19 data. bioRxiv. https://doi.org/10.1101/2020.10.22.328864

Romano, Y., Patterson, E., & Candès, E. (2019). Conformalized quantile regression. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 3543–3553). https://papers.nips.cc/paper/2019/hash/5103c3584b063c431bd1268e9b5e76fb-Abstract.html

Romano, Y., Sesia, M., & Candès, E. J. (2020). Classification with valid and adaptive coverage. arXiv. https://doi.org/10.48550/arXiv.2006.02544

Sadinle, M., Lei, J., & Wasserman, L. (2019). Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525), 223–234. https://doi.org/10.1080/01621459.2017.1395341

Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12), 371–421. https://jmlr.org/papers/v9/shafer08a.html

Sheffet, O. (2017). Differentially private ordinary least squares. In Proceedings of Machine Learning Research: Vol. 70. Proceedings of the 34th International Conference on Machine Learning (pp. 3105–3114). https://proceedings.mlr.press/v70/sheffet17a.html

Smith, A. (2011). Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the forty-third annual ACM symposium on theory of computing (pp. 813–822). https://doi.org/10.1145/1993636.1993743

Tibshirani, R. J., Barber, R. F., Candès, E., & Ramdas, A. (2019). Conformal prediction under covariate shift. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 2530–2540). https://papers.nips.cc/paper/2019/hash/8fb21ee7a2207526da55a679f0332de2-Abstract.html

Tukey, J. W. (1947). Non-parametric estimation II. statistically equivalent blocks and tolerance regions—the continuous case. Annals of Mathematical Statistics, 18(4), 529–539. https://doi.org/10.1214/aoms/1177730343

Vovk, V. (2012). Conditional validity of inductive conformal predictors. In Proceedings of Machine Learning Research: Vol. 25. Proceedings of the Asian conference on machine learning (pp. 475–490). https://proceedings.mlr.press/v25/vovk12.html

Vovk, V. (2015). Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74(1–2), 9–28. https://doi.org/10.1007/s10472-013-9368-4

Vovk, V., Gammerman, A., & Saunders, C. (1999). Machine-learning applications of algorithmic randomness. In Proceedings of the Sixteenth International Conference on Machine Learning (pp. 444–453).

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. https://doi.org/10.1007/b106715

Vovk, V., Petej, I., Toccaceli, P., Gammerman, A., Ahlberg, E., & Carlsson, L. (2020). Conformal calibrators. In Proceedings of Machine Learning Research: Vol. 128. Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications (pp. 84–99). https://proceedings.mlr.press/v128/vovk20a.html

Vovk, V., Shen, J., Manokhin, V., & Xie, M.-g. (2017). Nonparametric predictive distributions based on conformal prediction. Machine Learning, 108(3), 445–474. https://doi.org/10.1007/s10994-018-5755-8

Wald, A. (1943). An extension of Wilks’ method for setting tolerance limits. Annals of Mathematical Statistics, 14(1), 45–55. https://doi.org/10.1214/aoms/1177731491

Wang, Y., Kifer, D., & Lee, J. (2019). Differentially private confidence intervals for empirical risk minimization. Journal of Privacy and Confidentiality, 9(1). https://doi.org/10.29012/jpc.660

Wilks, S. S. (1941). Determination of sample sizes for setting tolerance limits. Annals of Mathematical Statistics, 12(1), 91–96. https://doi.org/10.1214/aoms/1177731788

Wilks, S. S. (1942). Statistical prediction with special reference to the problem of tolerance limits. Annals of Mathematical Statistics, 13(4), 400–409. https://doi.org/10.1214/aoms/1177731537

Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., & Winslett, M. (2013). Differentially private histogram publication. The VLDB Journal, 22(6), 797–822. https://doi.org/10.1007/s00778-013-0309-y


Appendices

Appendix A: Auxiliary results

We start with a result about the error of the private quantile mechanism, stated in Algorithm 2. The following is an extension of the the analogous result for the private median due to (Feldman & Steinke, 2017).

Lemma 1. For any δ(0,1)\delta\in(0,1), the differentially private quantile algorithm (Algorithm 2) satisfies:

P ⁣{1n#{i:[si]s^}q2max{1qq,1}log(m/δ)nϵ}1δ\mathbb{P}\!\left\{{ \frac{1}{n}\#\{i:[s_i]\leq \hat s\} \geq q - \frac{2\max\left\{\frac{1-q}{q},1\right\}\log(m/\delta)}{n\epsilon} }\right\}\geq 1-\delta

and

P ⁣{1n#{i:[si]<s^}q+2max{q1q,1}log(m/δ)nϵ}1δ.\mathbb{P}\!\left\{{ \frac{1}{n}\#\{i:[s_i]< \hat s\} \leq q + \frac{2\max\left\{\frac{q}{1-q},1\right\}\log(m/\delta)}{n\epsilon} }\right\}\geq 1-\delta.

Proof. By the standard utility guarantee for the exponential mechanism (McSherry & Talwar, 2007) (e.g., Corollary 3.12 in (Dwork & Roth, 2014)), we have

P ⁣{max{#{j:[sj]<s^}q,#{j:[sj]>s^}1q}<miniwi+2max{1/q,1/(1q)}log(m/δ)ϵ}1δ.\mathbb{P}\!\left\{{\max\left\{\frac{\#\{j:[s_j] < \hat s\}}{q},\frac{\#\{j:[s_j] > \hat s\}}{1-q}\right\} < \min_i w_i + \frac{2\max\{1/q,1/(1-q)\}\log(m/\delta)}{\epsilon}}\right\} \geq 1-\delta.(4)

First we argue that miniwin\min_i w_i \leq n. Let s=min{s{e0,,em}:#{i:[si]s}>qn}s^* = \min\{s\in\{e_0,\dots,e_m\}: \#\{i: [s_i] \leq s\}>qn\}. Then, #{i:[si]>s}<(1q)n\#\{i: [s_i] > s^*\} < (1-q)n trivially. Furthermore, #{i:[si]<s}qn\#\{i: [s_i] < s^*\} \leq q n by the definition of ss^*, since ss^* is the first point at which the cumulative fraction of scores less than or equal to ss^* exceeds qq. Since we have identified a bin where max{#{j:[sj]<s}q,#{j:[sj]>s}1q}n\max\left\{\frac{\#\{j:[s_j] < s^*\}}{q},\frac{\#\{j:[s_j] > s^*\}}{1-q}\right\} \leq n, we can conclude that miniwin\min_i w_i \leq n.

Going back to Equation 4, we have that with probability at least 1δ1-\delta,

1n#{i:sis^}=11n#{i:si>s}1(1q)miniwin2max{(1q)/q,1}log(m/δ)nϵq2max{(1q)/q,1}log(m/δ)nϵ.\begin{aligned} \frac{1}{n}\#\{i:s_i\leq \hat s\} &= 1 - \frac{1}{n}\#\{i: s_i > s\}\\ &\geq 1 - (1-q) \frac{\min_i w_i}{n} - \frac{2\max\{(1-q)/q,1\}\log(m/\delta)}{n\epsilon}\\ &\geq q - \frac{2\max\{(1-q)/q,1\}\log(m/\delta)}{n\epsilon}.\end{aligned}

Similarly,

1n#{i:si<s^}qminiwin+2max{1,q/(1q)}log(m/δ)nϵq+2max{1,q/(1q)}log(m/δ)nϵ.\begin{aligned} \frac{1}{n}\#\{i:s_i< \hat s\} &\leq q\frac{\min_i w_i}{n} + \frac{2\max\{1,q/(1-q)\}\log(m/\delta)}{n\epsilon}\\ &\leq q + \frac{2\max\{1,q/(1-q)\}\log(m/\delta)}{n\epsilon}.\end{aligned}

 Next, we package some classical facts about the distribution of order statistics in a form helpful for analyzing conformal prediction.

Lemma 2. Let FF be the CDF of a distribution supported on a finite set {a1,,am}\{a_1,\dots,a_m\}. Let Z1,,Zni.i.d. FZ_1,\dots,Z_n\stackrel{i.i.d.}{\sim}~F, and let F^\hat F denote the empirical CDF corresponding to Z1,,ZnZ_1,\dots,Z_n. Denote also pmaxm=max1imP ⁣{Z1=ai}p_{\max}^m = \max_{1\leq i\leq m}\mathbb{P}\!\left\{{Z_1 = a_i}\right\}. Then,

ZBeta+pmaxmF(F^1(q))ZBeta,Z_{Beta} + p_{\max}^m \succeq F(\hat{F}^{-1}(q)) \succeq Z_{Beta},

where ZBetaZ_{Beta} follows the beta distribution Beta([nq],n[nq+1){Beta}([nq], n - [nq\rceil + 1) and \succeq denotes first-order stochastic dominance.

Proof. Since we take F^1(q)=inf{z:F^(z)q}\hat F^{-1}(q) = \inf\{z: \hat {F}(z) \ge q\} by definition, then that implies F^1(q)=Z(nq)\hat F^{-1}(q) = Z_{(\lceil nq\rceil)}, where Z(i)Z_{(i)} denotes the ii-th nondecreasing order statistic of Z1,,ZnZ_1,\dots,Z_n. By monotonicity of FF, we further have that F(Z(nq))F(Z_{(\lceil nq\rceil)}) is identical to the nq\lceil nq\rceil-th nondecreasing order statistic of F(Z1),,F(Zn)F(Z_1),\dots,F(Z_n). By a standard argument, the samples F(Z1),,F(Zn)F(Z_1),\dots,F(Z_n) are super-uniform, that is, P ⁣{F(Z1)u}u\mathbb{P}\!\left\{{F(Z_1)\leq u}\right\}\leq u for all u[0,1]u\in[0,1]. In other words, they are stochastically larger than a uniform distribution on [0,1][0,1], and thus their nq\lceil nq\rceil-th order statistic is stochastically lower bounded by the nq\lceil nq\rceil-th order statistic of a uniform distribution, which follows the Beta(nα,nnα+1){Beta}(\lceil n\alpha\rceil, n - \lceil n\alpha\rceil + 1) distribution. This completes the proof of the lower bound. For the upper bound, we use the fact that P ⁣{F(Z1)u}upmaxm\mathbb{P}\!\left\{{F(Z_1) \leq u}\right\} \geq u - p_{\max}^m, and so F(Zi)F(Z_i) are stochastically dominated by Ui+pmaxmU_i + p_{\max}^m, where {Ui}i=1n\{U_i\}_{i=1}^n are i.i.d. uniform on [0,1][0,1]. Their nq\lceil nq\rceil-th order statistic is distributed as ZBeta+pmaxmZ_{Beta} + p_{\max}^m, which completes the proof.

Appendix B: Proofs

B.1. Proof of Theorem 2

First we introduce some notation. By FF we will denote the discretized CDF of the scores; in particular, for any i{1,,n}i\in\{1,\dots,n\},

F(s)=P ⁣{[si]s}.F(s) = \mathbb{P}\!\left\{{ [s_i] \leq s}\right\}.

Here, by [si][s_i] we denote a discretized version of sis_i where we set [si]=ej[s_i] = e_j if siIjs_i \in I_j. We also let F^\hat F denote the empirical distribution of the discretized scores:

F^(s)=1ni=1n1{[si]s}.\hat F(s) =\frac{1}{n}\sum_{i=1}^n \mathbf{1}\{[s_i]\leq s\}.

By convention, we let F1(δ)F^{-1}(\delta) denote the left-continuous inverse of FF, i.e. F1(δ):=inf{s:F(s)δ}F^{-1}(\delta) := \inf \{s: F(s)\geq \delta\}, and we similarly define F^1(δ)\hat F^{-1}(\delta).

We can write

P ⁣{YC(X)}=P ⁣{S(X,Y)s^}=E[F(s^)].\mathbb{P}\!\left\{{Y \in \mathcal{C}(X)}\right\} = \mathbb{P}\!\left\{{S(X,Y) \leq \hat s}\right\} = \mathbb{E}\left[F(\hat s)\right].

Denote the event E={1n#{i:[si]s^}q~2ϵnlog(m/(γα))}E=\left\{ \frac{1}{n}\#\{i:[s_i] \leq \hat s\} \geq \tilde q - \frac{2}{\epsilon n}\log(m/(\gamma\alpha))\right\}, and note that by Lemma 1 and the fact that q~0.5\tilde q \geq 0.5, P ⁣{E}1γα\mathbb{P}\!\left\{{E}\right\}\geq 1-\gamma\alpha. By splitting up the analysis depending on EE, we obtain the following:

E[F(s^)]=E[F(s^)1{Ec}]+E[F(s^)1{E}]γα0+E[F(s^)1{E}](1γα)E[F(F^1(q~2ϵnlog(m/(γα))))],\begin{aligned} \mathbb{E}\left[F(\hat s)\right] &= \mathbb{E}\left[F(\hat s)\mathbf{1}\{E^c\}\right] + \mathbb{E}\left[F(\hat s)\mathbf{1}\{E\}\right]\\ &\geq \gamma\alpha\cdot 0 + \mathbb{E}\left[F(\hat s)\mathbf{1}\{E\}\right]\\ &\geq (1-\gamma\alpha) \mathbb{E}\left[F\left(\hat{F}^{-1}\left(\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right)\right)\right) \right],\end{aligned}

where the final inequality follows by the definition of EE. Thus, it suffices to show that

E[F(F^1(q~2ϵnlog(m/(γα))))]1α1γα. \mathbb{E}\left[F\left(\hat{F}^{-1}\left(\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\right) \right] \ge \frac{1-\alpha}{1-\gamma\alpha}.(5)

Let j=n(q~2ϵnlog(m/(γα)))j^* = \left\lceil n \left(\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\right\rceil. Then, by Lemma 2,

F(F^1(q~2ϵnlog(m/(γα))))Beta(j,nj+1),F\left(\hat{F}^{-1}\left(\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right)\right)\right) \succeq {Beta}(j^*, n - j^* + 1),

so

E[F(F^1(q~2ϵnlog(m/(γα))))]jn+1=n(q~2ϵnlog(m/(γα)))n+1.\mathbb{E}\left[F\left(\hat{F}^{-1}\left(\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right)\right)\right)\right] \ge \frac{j^*}{ n+1} = \frac{\lceil n (\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right) )\rceil}{n+1}.

By the definition of q~\tilde q, we see that

n(q~2ϵnlog(m/(γα)))n+11α1γα,\frac{\lceil n (\tilde q - \frac{2}{\epsilon n}\log\left(m/(\gamma\alpha)\right) )\rceil}{n+1} \ge \frac{1-\alpha}{1-\gamma\alpha},

holds, which implies Equation 5 and thus completes the proof.

B.2. Proof of Theorem 3

We adopt the definitions of FF, F^\hat F from Theorem 2, and define EE as the event

{1n#{i:[si]<s^}q~+2max{q~1q~,1}log(m/γα)nϵ},\left\{ \frac{1}{n}\#\{i:[s_i]< \hat s\} \leq \tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}\log(m/\gamma\alpha)}{n\epsilon} \right\},

which by Lemma 1 has probability at least 1γα1-\gamma\alpha. By a similar reasoning as in Theorem 2, we obtain the following:

E[F(s^)]=E[F(s^)1{Ec}]+E[F(s^)1{E}] γα1+E[F(s^)1{E}] γα+(1γα)(E[F(F^1(q~+2max{q~1q~,1}ϵnlog(m/(γα))))]+pmaxm),\begin{aligned} \mathbb{E}\left[F(\hat s)\right] &= \mathbb{E}\left[F(\hat s)\mathbf{1}\{E^c\}\right] + \mathbb{E}\left[F(\hat s)\mathbf{1}\{E\}\right] \\\ &\leq \gamma\alpha\cdot 1 + \mathbb{E}\left[F(\hat s)\mathbf{1}\{E\}\right] \\\ &\leq \gamma\alpha + (1-\gamma\alpha) \left(\mathbb{E}\left[F\left(\hat{F}^{-1}\left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\right) \right] + p^m_{\max}\right),\end{aligned}(6)

where the final inequality follows by the definition of EE.

Let j=n(q~+2max{q~1q~,1}ϵnlog(m/(γα)))j^* = \lceil n \left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\rceil. By Lemma 2, we have

F(F^1(q~+2max{q~1q~,1}ϵnlog(m/(γα))))Beta(j,nj+1)+pmaxm,F\left(\hat{F}^{-1}\left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\right) \preceq {Beta}(j^*, n - j^* + 1) + p_{\max}^m,

so

E[F(F^1(q~+2max{q~1q~,1}ϵnlog(m/(γα))))]jn+1+pmaxm=n(q~+2max{q~1q~,1}ϵnlog(m/(γα)))n+1+pmaxm. \mathbb{E}\left[F\left(\hat{F}^{-1}\left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right)\right)\right) \right] \le \frac{j^*}{ n+1} + p_{\max}^m \\ = \frac{\lceil n \left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\rceil}{n+1} + p_{\max}^m.(7)

By the definition of q~\tilde q, we see that

n(q~+2max{q~1q~,1}ϵnlog(m/(γα)))n+11α1γα(n+1)+2(1+max{q~1q~,1})ϵlog(m/(γα))n+1 =1α1γα+2(1+max{q~1q~,1})log(m/(γα))(n+1)ϵ.\begin{aligned} \frac{\lceil n \left(\tilde q + \frac{2\max\{\frac{\tilde q}{1-\tilde q},1\}}{\epsilon n}\log\left(m/(\gamma\alpha)\right) \right)\rceil}{n+1} &\le \frac{\frac{1-\alpha}{1-\gamma\alpha}(n+1) + \frac{2(1 +\max\{\frac{\tilde q}{1-\tilde q},1\})}{\epsilon}\log\left(m/(\gamma\alpha)\right) }{n+1} \\\ &= \frac{1-\alpha}{1-\gamma\alpha} + \frac{2(1 +\max\{\frac{\tilde q}{1-\tilde q},1\})\log(m/(\gamma\alpha))}{(n+1) \epsilon }.\end{aligned}(8)

Putting together Equations (6), (7), and (8) completes the proof.

Appendix C: Experimental details

Choosing mm^* and γ\gamma. Algorithm 5 gives automatic choices of the optimal number of uniformly spaced bins, mm^*, and the tuning parameter γ\gamma that work well for approximately uniformly distributed scores. In a moment, we will show how to find the optimal value γ\gamma^* for a fixed value of mm. Once γ\gamma^* gets chosen, we will simulate uniformly distributed scores to choose the value mm^* that results in the best quantile for specific, pre-determined values of α\alpha, ϵ\epsilon, and nn. In practice, mm^* can be chosen from a relatively coarse grid of, say, 50 values logarithmically spaced from 10210^2 to 10610^6.

We start choosing the optimal value γ\gamma^* by solving for the zeros of the derivative δq~δγ\frac{\delta\tilde{q}}{\delta\gamma}, leading to the quadratic expression,

δq~δγ=0α2γ2(α(1α)ϵ(n+1)2+2α)γ+1=0. \frac{\delta\tilde{q}}{\delta \gamma} = 0 \Longleftrightarrow \alpha^2\gamma^2 -\left( \frac{\alpha(1-\alpha)\epsilon(n+1)}{2} +2\alpha \right)\gamma + 1 = 0.(9)

Letting Γ\Gamma be the roots of (9), we can then choose the optimal value γ\gamma^* as

γ=argminγΓ(0,1){1e12}[(n+1)(1α)n(1γα)+2ϵnlog(mγα)], \gamma^* = \mathop{\mathrm{arg\,min}}_{\gamma \in \Gamma \cap (0,1) \cup \{1e-12\}} \left[ \frac{(n+1)(1-\alpha)}{n(1-\gamma\alpha)} + \frac{2}{\epsilon n}\log\left(\frac{m}{\gamma\alpha}\right)\right],(10)

where the number 1e-12 takes care of the case that both roots lie outside the interval (0,1)(0,1).

Algorithm 5. Get optimal number of bins and γ\gamma

input: number of calibration points nn, privacy level ϵ>0\epsilon >0, confidence level α(0,1)\alpha\in (0,1)
Simulate nn uniform conformity scores siUnif(0,1),i=1,...,ns_i \sim {\rm Unif}(0,1), i=1,...,n
Choose mm^* to be the value of mm minimizing the output of Algorithm 3 on the sis_i with the optimal γ\gamma^* chosen by (10).

output: mm^*, γ\gamma^*

Private training procedure. We used the Opacus library with the default parameter choices included in the CIFAR-10 example code. The only difference in the nonprivate model training is the use of the ––disable–dp flag, turning off the added noise but preserving all other settings.


©2022 Anastasios Nikolas Angelopoulos, Stephen Bates, Tijana Zrnic, and Michael I. Jordan. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?