Skip to main content
SearchLoginLogin or Signup

With Malice Toward None: Assessing Uncertainty via Equalized Coverage

Published onApr 30, 2020
With Malice Toward None: Assessing Uncertainty via Equalized Coverage


An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology, equalized coverage, is flexible as it can be viewed as a wrapper around any predictive algorithm. We test the applicability of the proposed framework on real data, demonstrating that equalized coverage constructs unbiased prediction intervals, unlike competitive methods.

Keywords: conformal prediction, calibration, uncertainty quantification, fairness, quantile regression, unbiased predictions.

Media Summary

Machine learning algorithms are increasingly deployed in sensitive applications to inform the selection of job candidates, to inform bail and parole decisions, and to filter loan application, among many others. Such practices have become the subject of intense scrutiny, as society must be concerned about whether these algorithms reinforce discrimination and make the status quo normative. A major area of study has therefore been to propose mathematical definitions of appropriate notions of fairness or algorithmic models of fairness, and to make sure that learned models comply with such prescriptions. Because fairness is rather ill defined, it has been reported that such notions can be incompatible with/or hurt the groups they intend to protect.

In this work, we follow the prescription of Corbett-Davies and Goel (2018) and decouple the statistical problem of risk assessment from the policy problem of taking actions or designing interventions. Rather than dictating policy, our aim will solely be the design of statistical algorithms providing the decision maker with information summarizing the knowledge that can be extracted from state-of-the-art machine learning systems in a way that is mathematically guaranteed to be unbiased regardless of a person’s protected attributes, providing an operational definition of fairness. This is accomplished by constructing prediction sets, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology is flexible in the sense that it can be wrapped around any predictive algorithm. For instance, in a stylized application where a recommendation system for college admission predicts the GPA of candidate students after 2 years of undergraduate education, our methodology would modify this system to produce, for each student, a range of values obeying two properties. First, the range contains the true outcome 90% of the time (or any other percentage). Second, this property holds regardless of the group to which the student belongs.

1. Introduction

1.1. The Problem of Equitable Treatment

We are increasingly turning to machine learning systems to support human decisions. While decision makers may be subject to many forms of prejudice and bias, the promise and hope is that machines would be able to make more equitable decisions. Unfortunately, whether because they are fitted on already biased data or, otherwise, there are concerns that some of these data-driven recommendation systems treat members of different classes differently, perpetrating biases, providing different degrees of utilities, and inducing disparities. The examples that have emerged are quite varied:

  1. Criminal justice: Courts in the United States may use COMPAS—a commercially available algorithm to assess a criminal defendant’s likelihood of becoming a recidivist—to help them decide who should receive parole, based on records collected through the criminal justice system. In 2016 ProPublica analyzed COMPAS and found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk (Dieterich et al., 2016).1, 2

  2. Recognition system: A department of motor vehicles (DMV) may use facial recognition tools to detect people with false identities, by comparing driver’s license or ID photos with other DMV images on file. In a related context, Buolamwini and Gebru (2018) evaluated the performance of three commercial classification systems that employ facial images to predict individuals’ gender, and reported that the overall classification accuracy on male individuals was higher than female individuals. They also found that the predictive performance on lighter skinned individuals was higher than darker skinned individuals.

  3. College admissions: A college admission office may be interested in a new algorithm for predicting the college GPA of a candidate student at the end of their sophomore year, by using features such as high school GPA, SAT scores, AP courses taken and scores, intended major, levels of physical activity, and so on. On a similar matter, the work reported in Gardner et al. (2019) studied various data-driven algorithms that aim to predict whether a student will drop out from a massive open online course (MOOC). Using a large data set available from Gardner et al. (2018), the authors found that in some cases there are noticeable differences between the models’ predictive performance on male students compared to female students.

  4. Disease risk: Health care providers may be interested in predicting the chance that an individual develops certain disorders. Diseases with a genetic component have different frequencies in different human populations, reflecting the fact that disease-causing mutations arose at different times and in individuals residing in different areas: for example, Tay-Sachs disease is approximately 100 times more common in infants of Ashkenazi Jewish ancestry (Central-Eastern Europe) than in non-Jewish infants (Kaback et al., 1977). The genotyping of DNA polymorphisms can lead to more precise individual risk assessment than that derived from simply knowing to which ethnic group the individual belongs. However, given our still partial knowledge of the disease-causing mutations and their prevalence in different populations, the precision of these estimates varies substantially across ethnic groups. For instance, the study reported in Kessler et al. (2016) found a preference for European genetic variants over non-European variants in two genomic databases that are widely used by clinical geneticists (this reflects the fact that most studies have been conducted on European populations). Relying only on this information would result in predictions that are more accurate for individuals of European descent than for others.

The breadth of these examples underscores how data must be interpreted with care; the method that is advocated in this article is useful regardless of whether the disparity is due to factors of inequality/bias, or instead due to genetic risk. Indeed, policymakers have issued a call that Executive Office of the President (2014):

we must uphold our fundamental values so these systems are neither destructive nor opportunity limiting. [...] In order to ensure that growth in the use of data analytics is matched with equal innovation to protect the rights of Americans, it will be important to support research into mitigating algorithmic discrimination, building systems that support fairness and accountability, and developing strong data ethics frameworks.

This is a broad call that covers multiple aspects of data collection, mining, and interpretation; clearly, a response requires a multifaceted approach. Encouragingly, the machine learning community is beginning to respond to this challenge. A major area of study has been to propose mathematical definitions of appropriate notions of fairness (Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Zafar et al., 2017; Hebert-Johnson et al., 2018; Kim et al., 2019) or algorithmic models of fairness (Kusner et al., 2017). In many cases, these definitions are an attempt to articulate in mathematical terms what it means not to discriminate on the basis of protected characteristics; U.S. law identifies these as sex, race, age, disability, color, creed, national origin, religion, and genetic information. Now, discrimination can take many forms, and it is not surprising that it might be difficult to identify one analytical property that detects it in every context. Moreover, the call above is broader than the specific domains where discrimination is forbidden by law and invites us to develop analytical frameworks that guarantee an ethical use of data.

1.2. Responses From the Machine Learning Community

To begin to formalize the problem, it is useful to consider the task of predicting the value of YY, a binary {0,1}\{0,1\} variable, with a guess Y^\hat{Y}. We assume that Y=1Y=1 represents a more favorable state, and that the value of Y^\hat{Y} will influence the decider, so that predicting Y^=1\hat{Y}=1 for some individuals gives them an advantage. In this context, P(Y^=0Y=1){\mathbb P}(\hat{Y}=0 \mid Y=1), the false negative rate, represents the probability with which an opportunity is denied to a well deserving individual. It is obvious that this is a critical error rate to control in scenarios such as deciding parole (see Example 1): freedom is a fundamental right, and nobody should be deprived of it needlessly. We then wish to require that P(Y^=0Y=1,A=a){\mathbb P}(\hat{Y}=0 \mid Y=1, A=a) is equal across values of the protected attribute AA (Hardt et al., 2016). In the case of distributions of goods (as when giving a loan), one might argue for parity of other measures such as P(Y^=1A=a){\mathbb P}(\hat{Y}=1\mid A=a), which would guarantee that resources are distributed equally across the different population categories (Dwork et al., 2012; Feldman et al., 2015; Zafar et al., 2019). Indeed, these observations are at the basis of two notions of fairness considered in the literature.

Researchers have noted several problems with fairness measures that ask for (approximate) parity of some statistical measure across all of these groups. Without providing a complete discussion, we list some of these problems here. (a) To begin with, it is usually unclear how to design algorithms that would actually obey these notions of fairness from finite samples, especially in situations where the outcome of interest or protected attribute is continuous. (b) Even if we could somehow operationalize the fairness program, these measures are usually incompatible: it is provably impossible to design an algorithm that obeys all notions of fairness simultaneously (Chouldechova, 2017; Kleinberg et al., 2017). (c) The appropriate measure of fairness appears to be context dependent. Consider Example 4 and suppose that Y=1Y=1 corresponds to having Tay-Sachs, whose rate differs across populations. Due to the unbalanced nature of the disease, one would expect the predictive model to have a lower true positive rate for non-Jewish infants than that of Ashkenazi Jewish infants (for which the disease is much more common). Here, forcing parity of true positive rates (Hardt et al., 2016) would conflict with accurate predictions for each group (Hebert-Johnson et al., 2018). (d) Finally, and perhaps more importantly, researchers have argued that enforcing frequently discussed fairness criteria can often harm the very groups that these measures were designed to protect (Corbett-Davies and Goel, 2018).

In light of this, it has been suggested to decouple the statistical problem of risk assessment from the policy problem of taking actions and designing interventions. Quoting from Corbett-Davies and Goel (2018), “an algorithm might (correctly) infer that a defendant has a 20% chance of committing a violent crime if released, but that fact does not, in and of itself, determine a course of action.” Keeping away from policy then, how can we respond to the call in Executive Office of the President (2014) and provide a policymaker the best information gleaned from data while supporting equitable treatment? Our belief is that multiple approaches will be needed, and with this short article our aim is to introduce an additional tool to evaluate the performance of algorithms across different population groups.

1.3. This Article: Equalized Coverage

One fundamental way to support data ethics is not to overstate the power of algorithms and data-based predictions, but rather always accompany these with measures of uncertainty that are easily understandable by the user. This can be done, for example, by providing a plausible range of predicted values for the outcome of interest. For instance, consider a recommendation system for college admission (Example 3), not knowing about the accuracy of the prediction algorithm, we would like to produce for, each student, a predicted GPA interval [Y^lo, Y^hi][\hat{Y}_{\text{lo}}, \ \hat{Y}_{\text{hi}}] obeying the following two properties: First, the interval should be faithful in the sense that the true unknown outcome YY lies within the predicted range 90% of the time, say; second, this should be unbiased in that the average coverage should be the same within each group.

Such a predictive interval has the virtue of informing the decision maker about the evidence machine learning can provide while being explicit about the limits of predictive performance. If the interval is long, it just means that the predictive model can say little. Each group enjoys identical treatment, receiving equal coverage (e.g., 90%, or any level the decision maker wishes to achieve). Hence, the results of data analysis are unbiased to all. In particular, if the larger sample size available for one group overly influences the fit, leading to poor performance in the other groups, the prediction interval will make this immediately apparent through much wider confidence bands for the groups with fewer samples. Prediction intervals with equalized coverage, then, naturally assess and communicate the fact that an algorithm has varied levels of performance on different subgroups.

It seems impossible a priori to present information to the policymaker in such a compelling fashion without a strong model for dependence of the response YY on the features XX or protected attributes AA. In our college admission example, one may have trained a wide array of complicated predictive algorithms such as random forests or deep neural networks, each with its own suite of parameters; for all practical purposes, the fitting procedure may just as well be a black box. The surprise is that such a feat is possible under no assumption other than that of having samples that are drawn exchangeably—for example, they may be drawn i.i.d.—from a population of interest. We propose a concrete procedure, which acts as a wrapper around the predictive model, to produce valid prediction intervals that provably satisfy the equalized coverage constraint for any black box algorithm, sample size, and distribution. Such a procedure can be formulated by refining tools from conformal inference, a general methodology for constructing prediction intervals (Vovk et al., 1999; Papadopoulos et al., 2002; Vovk et al., 2005, 2009; Lei et al., 2013, 2018; Romano et al., 2019). Our contribution extends classical conformal inference as we seek a form of conditional rather than marginal coverage guarantee (Vovk, 2012; Lei and Wasserman, 2014; Barber et al., 2019).

The specific procedure we suggest to construct predictive intervals with equal coverage, then, supports equitable treatment in an additional dimension. Specifically, we use the same learning algorithm for all individuals, borrowing strength from the entire population, and leveraging the entire data set, while adjusting global predictions to make local confidence statements valid for each group. Such a training strategy may also improve the statistical efficiency of the predictive model, as illustrated by our experiments in Section 3. Of course, our approach comes with limitations as well: we discuss these and possible extensions in Section 4.

2. Equalized Coverage

Let {(Xi,Ai,Yi)}\{(X_i, A_i, Y_i)\}, i=1,,ni = 1, \ldots, n, be some training data where the vector XiRpX_i \in {\mathbb R}^p may contain the sensitive attribute Ai{0,1,2,}A_i \in \{0,1,2,\dots\} as one of the features. Consider a test point with known Xn+1X_{n+1} and An+1A_{n+1}, we aim to construct a prediction interval C(Xn+1,An+1)RC(X_{n+1},A_{n+1}) \subseteq {\mathbb R} which contains the unknown response Yn+1RY_{n+1} \in {\mathbb R}, with probability at least 1α1-\alpha on average within each group; here, 0<1α<10 < 1-\alpha < 1 is a desired coverage level. Our ideas extend to categorical responses in a fairly straightforward fashion; for brevity, we do not consider these extensions in this article. Interested readers will find details on how to build valid conformal prediction sets in classification problems in Shafer and Vovk (2008). Formally, we assume that the training and test samples {(Xi,Ai,Yi)}i=1n+1\{(X_i,A_i,Y_i)\}_{i=1}^{n+1} are drawn exchangeably from some arbitrary and unknown distribution PXAYP_{XAY}, and we wish that our prediction interval obeys the following property:

(1)P{Yn+1C(Xn+1,An+1)An+1=a}1α\begin{aligned} (1) \qquad {\mathbb P}\{Y_{n+1} \in C(X_{n+1},A_{n+1}) \mid A_{n+1}=a\} \geq 1-\alpha\end{aligned}

for all aa, where the probability is taken over the nn training samples and the test case. Once more, (1) must hold for any distribution PXAYP_{XAY}, sample size nn, and regardless of the group identifier An+1A_{n+1}. (While this only ensures that coverage is at least 1α1-\alpha for each group—and, therefore, the groups may have unequal coverage level—we will see that under mild conditions the coverage can also be upper bounded to lie very close to the target level 1α1-\alpha.)

In this section we present a methodology to achieve (1). Our solution builds on classical conformal prediction (Vovk et al., 2005; Lei et al., 2018) and the recent conformalized quantile regression (CQR) approach (Romano et al., 2019) originally designed to construct marginal distribution-free prediction intervals (see also Kivaranovic et al., 2019). CQR combines the rigorous coverage guarantee of conformal prediction with the statistical efficiency of quantile regression (Koenker and Bassett, 1978) and has been shown to be adaptive to the local variability of the data distribution under study. Below, we present a modification of CQR obeying (1). Then in Section 2.2, we draw connections to conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and explain how classical conformal inference can also be used to construct prediction intervals with equal coverage across protected groups.3

Before describing the proposed method we introduce a key result in conformal prediction, adapted to our conditional setting. Variants of the following lemma appear in the literature (Vovk, 2012; Vovk et al., 2005; Lei et al., 2018; Tibshirani et al., 2019; Romano et al., 2019).

Lemma 1. Suppose the random variables Z1,,Zm+1Z_1,\dots,Z_{m+1} are exchangeable conditional on Am+1=aA_{m+1}=a, and define Q1αQ_{1-\alpha} to be the

(1α)(1+1/m)-th empirical quantile of {Zi:1im}(1-\alpha)(1+1/m)\text{-th empirical quantile of } \left\{Z_i : 1\leq i \leq m \right\}

For any α(0,1)\alpha \in (0,1),

P{Zm+1Q1αAm+1=a}α.{\mathbb P}\{Z_{m+1} \leq Q_{1-\alpha} \mid A_{m+1}=a\} \geq \alpha.

Moreover, if the random variables Z1,,Zm+1Z_1,\dots,Z_{m+1} are almost surely distinct, then it also holds that

P{Zm+1Q1αAm+1=a}α+1/(m+1).{\mathbb P}\{Z_{m+1} \leq Q_{1-\alpha} \mid A_{m+1}=a\} \leq \alpha + 1/(m+1).

2.1. Group-Conditional Conformalized Quantile Regression (CQR)

Our method starts by randomly splitting the nn training points into two disjoint subsets; a proper training set {(Xi,Ai,Yi):iI1}\left\lbrace (X_i,A_i,Y_i): i \in \mathcal{I}_1 \right\rbrace and a calibration set {(Xi,Ai,Yi):iI2}\left\lbrace (X_i,A_i,Y_i): i \in \mathcal{I}_2 \right\rbrace. Then, consider any algorithm A\mathcal{A} for quantile regression that estimates conditional quantile functions from observational data, such as quantile neural networks (Taylor, 2000) (described in Appendix). To construct a prediction interval with 1α1-\alpha coverage, fit two conditional quantile functions on the proper training set,

(2){q^αlo,q^αhi}A({(Xi,Yi):iI1}),\begin{aligned} (2) \qquad \left\lbrace \hat{q}_{\alpha_{\text{\rm lo}}}, \hat{q}_{\alpha_{\text{\rm hi}}} \right\rbrace \leftarrow \mathcal{A}(\left\lbrace (X_i, Y_i): i \in \mathcal{I}_1 \right\rbrace), \end{aligned}

at levels αlo=α/2\alpha_{\text{\rm lo}}= \alpha/2 and αhi=1α/2\alpha_{\text{\rm hi}}= 1 - \alpha/2, say, and form a first estimate of the prediction interval C^init(x)=[q^αlo(x), q^αhi(x)]\hat{C}_{\text{init}}(x) = [\hat{q}_{\alpha_{\text{\rm lo}}}(x), \ \hat{q}_{\alpha_{\text{\rm hi}}}(x)] at X=xX=x. C^init(x)\hat{C}_{\text{init}}(x) is constructed with the goal that a new case with covariates xx should have probability 1α1-\alpha of its response lying in the interval C^init(x)\hat{C}_{\text{init}}(x), but the interval C^init(x)\hat{C}_{\text{init}}(x) was empirically shown to under- or over- cover the test target variable (Romano et al., 2019). (Quantile regression algorithms are not supported by finite sample coverage guarantees [Steinwart and Christmann, 2011; Takeuchi et al., 2006; Meinshausen, 2006; Zhou and Portnoy, 1996, 1998].)

This motivates the next step that borrows ideas from split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and CQR (Romano et al., 2019). Consider a group A=aA=a, and compute the empirical errors (often called conformity scores) achieved by the first guess C^init(x)\hat{C}_{\text{init}}(x). This is done by extracting the calibration points that belong to that group,

I2(a)={i:iI2 and Ai=a},\begin{aligned} \mathcal{I}_2(a) = \{ i : i\in \mathcal{I}_2 \ \text{and} \ A_i=a\},\end{aligned}

and evaluating

(3)Ei:=max{q^αlo(Xi)Yi,Yiq^αhi(Xi)},iI2(a).\begin{aligned} (3) \qquad E_i := \max\{\hat{q}_{\alpha_{\text{\rm lo}}}(X_i) - Y_i, Y_i - \hat{q}_{\alpha_{\text{\rm hi}}}(X_i)\}, \quad i \in \mathcal{I}_2(a). \end{aligned}

This step provides a family of conformity scores {Ei:iI2(a)}\left\lbrace E_i: i \in \mathcal{I}_2(a) \right\rbrace that are restricted to the group A=aA=a. Each score measures the signed distance of the target variable YiY_i to the boundary of the interval C^init(x)\hat{C}_{\text{init}}(x); if YiY_i is located outside the initial interval, then Ei>0E_i>0 is equal to the distance to the closest interval endpoint. If YiY_i lies inside the interval, then Ei0E_i\leq0 and its magnitude also equals the distance to the closest endpoint. As we shall see immediately below, these scores may serve to measure the quality of the initial guess C^init()\hat{C}_{\text{init}}(\cdot) and used to calibrate it as to obtain the desired distribution-free coverage. Crucially, our approach makes no assumptions on the form or the properties of C^init()\hat{C}_{\text{init}}(\cdot)—it may come from any model class, and is not required to meet any particular level of accuracy or coverage. Its role is to provide a base algorithm that effectively estimates the underlying uncertainty, around which we will build our predictive intervals.

Finally, the following crucial step builds a prediction interval for the unknown Yn+1Y_{n+1} given Xn+1=xX_{n+1}=x and An+1=aA_{n+1}=a. This is done by computing

Q1α(E,I2(a)):=(1α)(1+1/I2(a))-th empirical quantile of{Ei:iI2(a)},\begin{aligned} Q_{1-\alpha}(E, \mathcal{I}_2(a)) := (1-\alpha)(1+1/|\mathcal{I}_2(a)|)\text{-th empirical quantile of} \left\{E_i : i \in \mathcal{I}_2(a) \right\},\end{aligned}

which is then used to calibrate the first interval estimate as follows:

(4)C^(x,a):=[q^αlo(x)Q1α(E,I2(a)), q^αhi(x)+Q1α(E,I2(a))].\begin{aligned} (4) \qquad \hat C(x, a) := \left[ \hat{q}_{\alpha_{\text{\rm lo}}}(x) - Q_{1-\alpha}(E, \mathcal{I}_2(a)) , \ \hat{q}_{\alpha_{\text{\rm hi}}}(x) + Q_{1-\alpha}(E, \mathcal{I}_2(a)) \right]. \end{aligned}

Before proving the validity of the interval in (4), we pause to present two possible training strategies for the initial quantile regression interval C^init(x).\hat{C}_{\text{init}}(x). We refer to the first as joint training as it uses the whole proper training set to learn a predictive model, see (2). The second approach, which we call groupwise training, constructs a prediction interval for Yn+1Y_{n+1} separately for each group; that is, for each value a=0,1,2,a=0,1,2,\dots, we fit a regression model to all training examples with An+1=aA_{n+1}=a. These two variants of the CQR procedure are summarized in Algorithm 1. While the statistical efficiency of the two approaches can differ (as we show in Section 3), both are guaranteed to attain valid group-conditional coverage for any data distribution and regardless of the choice or accuracy of the quantile regression estimate.

Theorem 1. If (Xi,Ai,Yi)(X_i, A_i, Y_i), i=1,,n+1i=1,\dots, n+1 are exchangeable, then the prediction interval C^(Xn+1,An+1)\hat{C}(X_{n+1},A_{n+1}) constructed by Algorithm 1 obeys

P{Yn+1C^(Xn+1,An+1)An+1=a}1α{\mathbb P}\{Y_{n+1} \in \hat C(X_{n+1}, A_{n+1}) \mid A_{n+1}=a \} \geq 1-\alpha

for each group a=0,1,2,a=0,1,2,\dots. Moreover, if the conformity scores {Ei: iI2(a){n+1}}\{E_i : \ i\in\mathcal{I}_2(a) \cup \{n+1\}\} for An+1=aA_{n+1}=a are almost surely distinct, then the group-conditional prediction interval is nearly perfectly calibrated:

P{Yn+1C^(Xn+1,An+1)An+1=a}1α+1I2(a)+1{\mathbb P}\{Y_{n+1} \in \hat C(X_{n+1}, A_{n+1}) \mid A_{n+1}=a\} \leq 1-\alpha+\frac{1}{|\mathcal{I}_2(a)|+1}

for each group a=0,1,2,a=0,1,2,\dots.

Proof. Fix any group aa. Since our calibration samples are exchangeable, the conformity scores (3) {Ei: iI2(a)}\{E_{i} : \ i \in \mathcal{I}_2(a)\} are also exchangeable. Exchangeability also holds when we add the test score En+1E_{n+1} to this list. Consequently, by Lemma 1,

(5)1αP(En+1Q1α(E,I2(a))An+1=a)1α+1I2(a)+1,(5) \quad 1-\alpha \leq {\mathbb P}(E_{n+1} \leq Q_{1-\alpha}(E, \mathcal{I}_2(a)) \mid A_{n+1}=a) \leq 1-\alpha+\frac{1}{|\mathcal{I}_2(a)|+1},

where the upper bound holds under the additional assumption that {Ei: iI2(a){n+1}}\{E_{i} : \ i \in \mathcal{I}_2(a) \cup \{n+1\}\} are almost surely distinct, while the lower bound holds without this assumption.

To prove the validity of C^(Xn+1,An+1)\hat C(X_{n+1},A_{n+1}) conditional on An+1=aA_{n+1}=a, observe that, by definition,

Yn+1C^(Xn+1,An+1)if and only ifEn+1Q1α(E,I2(a)).Y_{n+1} \in \hat C(X_{n+1},A_{n+1}) \quad \text{if and only if} \quad E_{n+1} \le Q_{1-\alpha}(E, \mathcal{I}_2(a)).

Hence, the result follows from (5).

Variant: Asymmetric group-conditional CQR. When the distribution of the conformity scores is highly skewed, the coverage error may spread asymmetrically over the left and right tails. In some applications it may be better to consider a variant of that controls the coverage of the two tails separately, leading to a stronger conditional coverage guarantee. To achieve this goal, we follow the approach from Romano et al. (2019) and evaluate two separate empirical quantile functions: one for the left tail,

Q1αlo(Elo,I2(a)):=(1αlo)(1+1/I2(a))-th empirical quantile of{q^αlo(Xi)Yi:iI2(a)};\begin{aligned} Q_{1-\alpha_{\text{\rm lo}}}(E_{\mathrm{lo}}, & \mathcal{I}_2(a)) := \\ & (1-\alpha_{\text{\rm lo}})(1+1/|\mathcal{I}_2(a)|)\text{-th empirical quantile of} \left\{\hat{q}_{\alpha_{\text{\rm lo}}}(X_i) - Y_i : i \in \mathcal{I}_2(a) \right\};\end{aligned}

and the second for the right tail

Q1αhi(Ehi,I2(a)):=(1αhi)(1+1/I2(a))-th empirical quantile of{Yiq^αhi(Xi):iI2(a)}.\begin{aligned} Q_{1-\alpha_{\text{\rm hi}}}(E_{\mathrm{hi}}, & \mathcal{I}_2(a)) := \\ & (1-\alpha_{\text{\rm hi}})(1+1/|\mathcal{I}_2(a)|)\text{-th empirical quantile of} \left\{Y_i - \hat{q}_{\alpha_{\text{\rm hi}}}(X_i): i \in \mathcal{I}_2(a) \right\}.\end{aligned}

Next, we set α=αlo+αhi\alpha = \alpha_{\text{\rm lo}}+ \alpha_{\text{\rm hi}} and construct the interval for Yn+1Y_{n+1} given Xn+1=xX_{n+1}=x and An+1=aA_{n+1}=a:

(6) C^(x,a):=[q^αlo(x)Q1αlo(Elo,I2(a)),  q^αhi(x)+Q1αhi(Ehi,I2(a))](6) \ \hat C(x,a) := [\hat{q}_{\alpha_{\text{\rm lo}}}(x) - Q_{1-\alpha_{\text{\rm lo}}}(E_{\mathrm{lo}}, \mathcal{I}_2(a)),\; \hat{q}_{\alpha_{\text{\rm hi}}}(x) + Q_{1-\alpha_{\text{\rm hi}}}(E_{\mathrm{hi}}, \mathcal{I}_2(a))]

The validity of this procedure is stated below.

Theorem 2. Suppose the samples (Xi,Ai,Yi)(X_i, A_i, Y_i), i=1,,n+1i=1,\dots, n+1 are exchangeable. With the notation above, put lower(Xn+1)=q^αlo(Xn+1)Q1αlo(Elo,I2(An+1))\operatorname{lower}(X_{n+1}) = \hat{q}_{\alpha_{\text{\rm lo}}}(X_{n+1}) - Q_{1-\alpha_{\text{\rm lo}}}(E_{\mathrm{lo}},\mathcal{I}_2(A_{n+1})) and upper(Xn+1)=q^αhi(Xn+1)+Q1αhi(Ehi,I2(An+1))\operatorname{upper}(X_{n+1}) = \hat{q}_{\alpha_{\text{\rm hi}}}(X_{n+1}) + Q_{1-\alpha_{\text{\rm hi}}}(E_{\mathrm{hi}}, \mathcal{I}_2(A_{n+1})) for short. Then

1αloP{Yn+1lower(Xn+1)An+1=a}1αlo+1I2(a)+1%\label{eq:asymmetric-coverage-lower} 1-\alpha_{\text{\rm lo}}\leq {\mathbb P}\{Y_{n+1} \geq \operatorname{lower}(X_{n+1}) \mid A_{n+1}=a\} \leq{1-\alpha_{\text{\rm lo}}+\frac{1}{|\mathcal{I}_2(a)|+1}}


1αhiP{Yn+1upper(Xn+1)An+1=a}1αhi+1I2(a)+1,% \label{eq:asymmetric-coverage-upper} 1-\alpha_{\text{\rm hi}}\leq {\mathbb P}\{Y_{n+1} \leq \operatorname{upper}(X_{n+1}) \mid A_{n+1}=a\}\\ \leq{1-\alpha_{\text{\rm hi}}+\frac{1}{|\mathcal{I}_2(a)|+1}},

where the lower bounds above always hold while the upper bounds hold under the additional assumption that the residuals are almost surely distinct. Under these circumstances, the interval (5) obeys

1(αlo+αhi)P{Yn+1C^(Xn+1,An+1)An+1=a}1(αlo+αhi)+2I2(a)+1. 1 - (\alpha_{\text{\rm lo}}+ \alpha_{\text{\rm hi}}) \leq {\mathbb P}\{Y_{n+1} \in \hat C(X_{n+1}, A_{n+1}) \mid A_{n+1}=a \} \leq{1-(\alpha_{\text{\rm lo}}+ \alpha_{\text{\rm hi}})+\frac{2}{|\mathcal{I}_2(a)|+1}}.

Proof. As in the proof of Theorem 1, the validity of the lower and upper bounds is obtained by applying Lemma 1 twice.

2.2. Group-Conditional Conformal Prediction

The difference between CQR (Romano et al., 2019) and split conformal prediction (Papadopoulos et al., 2002) is that the former calibrates an estimated quantile regression interval C^init(X)\hat{C}_{\text{init}}(X), while the latter builds a prediction interval around an estimate of the conditional mean Y^=μ^(X)\hat Y = \hat \mu(X). For instance, μ^\hat \mu can be formulated as a classical regression function estimate, obtained by minimizing the mean-squared-error loss over the proper training examples. To construct predictive intervals for the group A=aA=a, then simply replace both q^αlo\hat{q}_{\alpha_{\text{\rm lo}}} and q^αhi\hat{q}_{\alpha_{\text{\rm hi}}} with μ^\hat{\mu} in (or in its two-tailed variant). The theorems go through, and this procedure gives predictive intervals with exactly the same guarantees as before. As we will see in our empirical results, a benefit of explicitly modeling quantiles is superior statistical efficiency.

3. Case Study: Predicting Utilization of Medical Services

The Medical Expenditure Panel Survey (MEPS) 2016 data set,4 provided by the Agency for Healthcare Research and Quality, contains information on individuals and their utilization of medical services. The features used for modeling include age, marital status, race, poverty status, functional limitations, health status, health insurance type, and more. We split these features into dummy variables to encode each category separately. The goal is to predict the health care system utilization of each individual; a score that reflects the number of visits to a doctor’s office, hospital visits, and so on. After removing observations with missing entries, there are n=15,656n=15,656 observations on p=139p=139 features. We set the sensitive attribute AA to race, with A=0A=0 for nonwhite and A=1A=1 for white individuals, resulting in n0=9,640n_0=9,640 samples for the first group and n1=6,016n_1=6,016 for the second. In all experiments we transform the response variable by Y=log(1+(utilization score))Y = \log(1 + (\text{utilization score})) as the raw score is highly skewed.

Figure 1. Empirical cumulative distribution function of the signed residuals R = Y −  for both values of the sensitive attribute, computed on test samples. The blue dashed horizontal line is the value of ℙ{Y ≤ |A = 0} equal to 0.51. Similarly, the red dashed horizontal line is ℙ{Y ≤ |A = 1} = 0.45. The dashed vertical colored lines present the 0.05th and 0.95th quantiles of each group, defined in (6) and (7) respectively. Left side: r0lo =  − 1.04 (blue), r1lo =  − 1.25 (red). Right side: r0hi = 1.59 (blue) and r1hi = 1.83 (red).

Below, we illustrate that empirical quantiles can be used to detect prediction bias. Next, we show that usual (marginal) conformal methods do not attain equal coverage across the two groups. Finally, we compare the performance of joint vs. groupwise model fitting and show that, in this example, the former yields shorter predictive intervals.

3.1. Bias Detection

We randomly split the data into training (80%) and test (20%) sets and standardize the features to have zero mean and unit variance; the means and variances are computed using the training examples. Then we fit a neural network regression function μ^\hat{\mu} on the training set, where the network architecture, optimization, and hyperparameters are similar to those described and implemented in Romano et al. (2019).5 Next, we compute the signed residuals of the test samples,

Ri=YiY^i,\begin{aligned} R_i = Y_i - \hat{Y}_i,\end{aligned}

where Y^i=μ^(Xi)\hat{Y}_i = \hat{\mu}(X_i), and plot the resulting empirical cumulative distribution functions P{RrA=0}{\mathbb P}\{R \leq r | A=0\} and P{RrA=1}{\mathbb P}\{R \leq r | A=1\} in Figure 1. Observe that P{RrA=0}P{RrA=1}{\mathbb P}\{R \leq r | A=0\} \neq {\mathbb P}\{R \leq r | A=1\}. In particular, when comparing the two functions at r=0r=0, we see that μ^\hat{\mu} overestimates the response of the non-White group and underestimates the response of the White group, as

P{YY^A=0}=0.51>0.45=P{YY^A=1}.\begin{aligned} {\mathbb P}\{Y \leq \hat{Y} | A=0\} = 0.51 > 0.45 = {\mathbb P}\{Y \leq \hat{Y} | A=1\}.\end{aligned}

Recall that the lower and upper quantiles of the signed residuals are used to construct valid group-conditional prediction intervals. While these must be evaluated on calibration examples (see next section), for illustrative purposes we present below the 0.05th and 0.95th quantiles of each group using the two cumulative distribution functions of test residuals. To this end, we denote by r0lor^{\text{lo}}_0 and r1lor^{\text{lo}}_1 the lower empirical quantiles of the non-White and White groups, defined to be the smallest numbers obeying the relationship

P{Rr0loA=0}0.05andP{Rr1loA=1}0.05.{\mathbb P}\{R \leq r^{\text{lo}}_0 | A = 0\} \geq 0.05 \quad \text{and} \quad {\mathbb P}\{R \leq r^{\text{lo}}_1 | A = 1\} \geq 0.05.

Following Figure 1, this pair is equal to

r0lo=1.04>1.25=r1lo,r^{\text{lo}}_0 = -1.04 > -1.25 = r^{\text{lo}}_1,

implying that for at least 5% of the test samples of each group, the fitted regression function μ^\hat{\mu} overestimates the utilization of medical services with larger errors for White individuals than for non-White individuals.

As for the upper empirical quantiles, we compute the smallest r0hir^{\text{hi}}_0 and r1hir^{\text{hi}}_1 obeying

P{Rr0hiA=0}0.95andP{Rr1hiA=1}0.95,{\mathbb P}\{R \leq r^{\text{hi}}_0 | A = 0\} \geq 0.95 \quad \text{and} \quad {\mathbb P}\{R \leq r^{\text{hi}}_1 | A = 1\} \geq 0.95,

and obtain

r0hi=1.59<1.83=r1hi.r^{\text{hi}}_0 = 1.59 < 1.83 = r^{\text{hi}}_1.

Here, in order to cover the target variable for White individuals at least 95% of the time, we should inflate the regression estimate by an additive factor equal to 1.831.83. For non-White individuals, the additive factor is smaller and equal to 1.591.59. This shows that μ^\hat{\mu} systematically predicts higher utilization of non-White individuals relative to White individuals.

3.2. Achieving Equalized Coverage

We now verify that our proposal constructs intervals with equal coverage across groups. Below, we set α=0.1\alpha = 0.1. To avoid the coverage errors to be spread arbitrarily over the left and right tails, we choose to control the two tails independently by setting αlo=αhi=α/2=0.05\alpha_{\text{\rm lo}}= \alpha_{\text{\rm hi}}= \alpha/2 = 0.05 in (5). We arbitrarily set the size of the proper training and calibration sets to be identical. (The features are standardized as discussed earlier.)

For our experiments, we test six different methods for producing conformal predictive intervals. We compare two types of constructions for the predictive interval:

  • Conformal prediction (CP), where the predictive interval is built around an estimated mean μ^\hat{\mu} (as described in Section 2.2);

  • Conformalized quantile regression (CQR), where the predictive interval is constructed around initial estimates q^αlo\hat{q}_{\alpha_{\text{lo}}} and q^αhi\hat{q}_{\alpha_{\text{hi}}} of the lower and upper quantiles.

In both cases, we use a neural network to construct the models; we train the models using the software provided by Romano et al. (2019), using the same neural network design and learning strategy. For both the CP and CQR constructions, we then implement three versions:

  • Marginal coverage, where the intervals C^(X)\hat{C}(X) are constructed by pooling all the data together rather than splitting into subgroups according to the value of AA;

  • Conditional coverage with groupwise models, where the initial model for the mean μ^\hat{\mu} or for the quantiles q^αlo,q^αhi\hat{q}_{\alpha_{\text{lo}}},\hat{q}_{\alpha_{\text{hi}}} is constructed separately for each group A=0A=0 and A=1A=1;

  • Conditional coverage with a joint model, where the initial model for the mean μ^\hat{\mu} or for the quantiles q^αlo,q^αhi\hat{q}_{\alpha_{\text{lo}}},\hat{q}_{\alpha_{\text{hi}}} is constructed pooling data across both groups A=0A=0 and A=1A=1.

The results are summarized in Table 1, displaying the average length and coverage of the marginal and group-conditional conformal methods. These are evaluated on unseen test data and averaged over 40 train-test splits, where 80% of the samples are used for training (the calibration examples are a subset of the training data) and 20% for testing. All the conditional methods perfectly achieve 90% coverage per group (this is a theorem after all). On the other hand, the marginal CP method under-covers in the White group and overcovers in the non-White group (interestingly, though, the marginal CQR method almost attains equalized coverage even though it is not designed to give such a guarantee).

Turning to the statistical efficiency of the conditional conformal methods, we see that conditional CQR outperforms conditional CP in that it constructs shorter and, hence, more informative intervals, especially for the non-White group. The table also shows that the intervals for the White group are wider than those for the non-White group across all four conditional methods, and that joint model fitting is here more effective than groupwise model fitting as the former achieves shorter prediction intervals.

4. Discussion

4.1. Larger Intervals for a Subpopulation

It is possible that the intervals constructed with our procedure have different lengths across groups. For example, our experiments show that, on average, the White group has wider intervals than the non-White group. Although one might argue that the different length distribution is in itself a type of unfairness, we want to caution the reader against assuming that a fair statistical procedure must necessarily produce intervals of equal length.

There are multiple aspects to consider. First, we believe that when there is a difference in performance across the protected groups, one needs to make this evident to the user and to understand the reasons behind it (we discuss below the issue with artificially forcing the two intervals to be of the same length). In some cases this difference might be reduced by improving the predictive algorithm, collecting more data for the population associated with poorer performance, introducing new features with higher predictive power, and so on. For example, in the context of studies that aim to predict disease risk on the basis of genetic features, it has become apparent that existing risk assessment tools suffer bias due to being constructed based on samples coming primarily from European populations; these tools will be much more effective if based on a larger sample that better reflects the diversity in the general population. It may also be the case that higher predictive precision in one group versus another may arise from bias, whether intentional or not, in the type of model we use, the choice of features we measure, or other aspects of our regression process—for example, if historically more emphasis was placed on finding accurate models for a particular group aa, then we may be measuring features that are highly informative for prediction within group aa while another group aa' would be better served by measuring a different set of variables. Crucially, we do not want to mask this differential in information, but rather make it explicit—thereby possibly motivating the decision makers to take action.

We also note that in some cases, reducing the difference in performance might not be possible while increasing information. For example, the collection of a large enough sample for a minority population might be impossible due to privacy considerations and financial burden. Or the outcome in question might have structurally different variability across the groups. In such cases, equal length prediction intervals might be constructed only artificially, reducing the precision of the statements one can make for a given group—a choice that should be made with the participation of users and policymakers, rather than by data analysts alone.

4.2. The Use of Protected Attribute

The debate around fairness in general, and our proposal in particular, requires the definition of classes of individuals across which we would like an unbiased treatment. In some cases these coincide with protected attributes where discrimination on their basis is prohibited by the law. The legislation sometimes does not allow the decision maker to know/use the protected attribute in reaching a conclusion, as a measure to caution against discrimination. While no discrimination is a goal everyone should embrace regardless whether the law mandates it or not, we shall consider the opportunity of using protected attributes in data-driven recommendation systems. On the one hand, ignoring protected attributes is certainly not sufficient to guarantee absence of discrimination (see, e.g., Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Corbett-Davies and Goel, 2018; Buolamwini and Gebru, 2018; Gardner et al., 2019; Zafar et al., 2017; Chouldechova, 2017). On the other hand, information on protected attributes might be necessary to guarantee equitable treatment. Our procedure relies on the knowledge of protected attributes, so we want to expand on this last point a little. In absence of knowledge of what are the causal determinants of an outcome, protected attributes can be an important component of a predictor. To quote from Corbett-Davies and Goel (2018): in the criminal justice system, for example, women are typically less likely to commit a future violent crime than men with similar criminal histories. As a result, gender-neutral risk scores can systematically overestimate a woman’s recidivism risk, and can in turn encourage unnecessarily harsh judicial decisions. Recognizing this problem, some jurisdictions, like Wisconsin, have turned to gender-specific risk assessment tools to ensure that estimates are not biased against women. For disease risk assessment (Example 4 earlier) or related tasks such as diagnosis and drug prescription, race often provides relevant information and is routinely used. Presumably, once we understand the molecular basis of diseases and drug responses, and once sufficiently accurate measurements on patients are available, race may cease to be useful. Given present circumstances, however, Risch et al. (2002) argue that identical treatment is not equal treatment and that a race-neutral or color-blind approach to biomedical research is neither equitable nor advantageous, and would not lead to a reduction of disparities in disease risk or treatment efficacies between groups. In our context, the use of protected attributes allows a rigorous evaluation of the potentially biased performance for different groups.

Clearly, our current proposal can be adopted only when data on protected attributes has been collected; generalizations of the proposed methodology to situations where the group identifier is unknown are topics for further research.

4.3. Conclusion and Future Work

We add to the tools that support fairness in data-driven recommendation systems by developing a highly operational method that can augment any prediction rule with the best available unbiased uncertainty estimates across groups. This is achieved by constructing prediction intervals that attain valid coverage regardless of the value of the sensitive attribute. The method is supported by rigorous coverage guarantees, as demonstrated on real-data examples. Although the focus of this article is on continuous response variables, one can adapt tools from conformal inference (Vovk et al., 2005) to construct prediction sets with equalized coverage for categorical target variables as well.

In this article, we have not discussed other measures of fairness: we believe an appropriate comparison would require much larger space, and would benefit from the inclusion of multiple voices. In evaluating the different proposals of the growing literature on algorithmic fairness, it might be useful to keep in mind a distinction between properties that should be required versus properties that are merely desirable. As an analogy, in statistical hypothesis testing, most commonly we require a bound on the false positive rate (Type I error); under this constraint, high power (low Type II error) is then desirable.

One century of statistical reasoning has taught us the importance of quantifying uncertainty and error. No algorithm should be ever deployed without a precise and intelligible description of the errors it makes and the statistical guarantees it offers. As practitioners know too well, it is most often not possible to guarantee that all errors are below a certain threshold. It becomes then crucial to select which statistical guarantee is most relevant for a problem, and fairness requires it to hold across different population groups. So, in the case of parole, we might think that the most crucial error to avoid is that of denying freedom to a deserving individual, and we should then enforce the probability of this error to be below the desired threshold in each population group. Or, as in the case of this article, we might want to provide the user with a 90% predictive interval for the GPA of a student, and we then need to require that its coverage is as advertised in each population. Equality in other measures of performance that have not been identified as primary (as the length of the predictive intervals) might then be desirable, but should not be prescribed and automatically pursued, without a conscious evaluation of the associated costs.

The knowledgeable reader will recognize that our approach is therefore different from the principle of equalized odds advocated in Hardt et al. (2016), which enforces that the two types of errors one can make in a binary classification problem must both be the same across the groups under study. (The cost is here that the algorithm would then need to change the predictions in at least one group to achieve the desired objective; this may be far from desirable and would not treat individuals equitably.) Returning to the distinction between a prescription and a wishlist, we make equalized coverage prescriptive. This does not mean that the data analyst cannot pay attention to other measures of fairness. For instance, she has the freedom to select predictive algorithms that score high on other metrics, for example, by adding empirical constraints to the construction of prediction sets (or intervals). We hope to report on progress in this direction in a future publication.

Disclosure Statement

E. C. was partially supported by the Office of Naval Research (ONR) under grant N00014-16-1-2712, by the Army Research Office (ARO) under grant W911NF-17-1-0304, by the Math + X award from the Simons Foundation and by a generous gift from TwoSigma. Y. R. was partially supported by the ARO grant and by the same Math + X award. Y. R. thanks the Zuckerman Institute, ISEF Foundation and the Viterbi Fellowship, Technion, for providing additional research support. R. F. B. was partially supported by the National Science Foundation via grant DMS-1654076 and by an Alfred P. Sloan fellowship. C. S. was partially supported by the National Science Foundation via grant DMS-1712800.


Quantile Neural Networks

We follow Koenker and Bassett (1978) and cast the estimation problem of the conditional quantiles of YX=xY | X {=} x as an optimization problem. Given training examples {(Xi,Yi):iI1}\{(X_i,Y_i) : i \in \mathcal{I}_1\}, we fit a parametric model using the pinball loss (Koenker and Bassett, 1978; Steinwart and Christmann, 2011), defined by

ρα(yy^):={α(yy^),if yy^>0,(α1)(yy^),otherwise\begin{aligned} \rho_\alpha(y - \hat y) := \begin{cases} \alpha (y - \hat y), & \text{if } y - \hat y > 0, \\ (\alpha-1) (y - \hat y) ,& \text{otherwise} \end{cases}\end{aligned}

where y^\hat y is the output of a regression function q^α(x)\hat q_\alpha(x) formulated as a deep neural network. The network design and training algorithm are identical to those described in Romano et al. (2019)6. Specifically, we use a two-hidden-layer neural network, with ReLU nonlinearities. The hidden dimension of both layers is set to 64. We use Adam optimizer Kingma and Ba (2014), with minibatches of size 6464 and a fixed learning rate of 5×1045 \times 10^{-4}. We employ weight decay regularization with parameter equal to 10610^{-6} and also use dropout (Srivastava et al., 2014) with a dropping rate of 0.10.1. We tune the number of epochs using cross validation (early stopping), with an upper limit of 1,0001,000 epochs.

Algorithm 1


Data (Xi,Ai,Yi)Rp×N×R, 1in(X_i, A_i, Y_i) \in {\mathbb R}^p \times {\mathbb N}\times {\mathbb R}, \ 1\leq i \leq n. Nominal coverage level 1α(0,1)1-\alpha \in (0,1). Quantile regression algorithm A\mathcal{A}. Training mode: joint/groupwise. Test point Xn+1=xX_{n+1}={x} with sensitive attribute An+1=aA_{n+1}=a.


Randomly split {1,,n}\left\lbrace 1,\dots,n \right\rbrace into two disjoint sets I1\mathcal{I}_1 and I2\mathcal{I}_2. If joint training: Fit quantile functions on the whole proper training set: {q^αlo,q^αhi}A({(Xi,Yi):iI1})\left\lbrace \hat{q}_{\alpha_{\text{\rm lo}}}, \hat{q}_{\alpha_{\text{\rm hi}}} \right\rbrace \leftarrow \mathcal{A}(\left\lbrace (X_i, Y_i): i \in \mathcal{I}_1 \right\rbrace). Else use groupwise training: Fit quantile functions on the proper training examples from group An+1=aA_{n+1}=a: {q^αlo,q^αhi}A({(Xi,Yi):iI1 and Ai=a})\left\lbrace \hat{q}_{\alpha_{\text{\rm lo}}}, \hat{q}_{\alpha_{\text{\rm hi}}} \right\rbrace \leftarrow \mathcal{A}(\left\lbrace (X_i, Y_i): i \in \mathcal{I}_1 \text{ and } A_i=a \right\rbrace). Compute EiE_i for each iI2(a)i \in \mathcal{I}_2(a), as in (3). Compute Q1α(E,I2(a))Q_{1-\alpha}(E, \mathcal{I}_2(a)), the (1α)(1+1/I2(a))(1-\alpha)(1+1/|\mathcal{I}_2(a)|)-th empirical quantile of {Ei:iI2(a)}\left\{E_i : i \in \mathcal{I}_2(a) \right\}.


Prediction interval C^(x,a)=[q^αlo(x)Q1α(E,I2(a)), q^αhi(x)+Q1α(E,I2(a))]\hat C(x,a) = \left[ \hat{q}_{\alpha_{\text{\rm lo}}}(x) - Q_{1-\alpha}(E, \mathcal{I}_2(a)) , \ \hat{q}_{\alpha_{\text{\rm hi}}}(x) + Q_{1-\alpha}(E, \mathcal{I}_2(a)) \right] for the unknown response Yn+1Y_{n+1}.

Table 1. Length and coverage of both marginal and group-conditional prediction intervals (α=0.1)(\alpha =0.1) constructed by conformal prediction (CP) and conformalized quantile regression (CQR) for Medical Expenditure Panel Survey (MEPS) data set.



Average Coverage

Average Length




*Marginal CP







Conditional CP (groupwise)







Conditional CP (joing)







*Marginal CQR







Conditional CQR (groupwise)







Conditional CQR (joint)




Note: The results are averaged across 40 random train-test (80%/20%) splits. Groupwise—two independent predictive models are used, one for non-White and another for White individuals; joint—the same predictive model is used for all individuals. In all cases, the model is formulated as a neural network. The methods marked by an asterisk are not supported by a group-conditional coverage guarantee.

Data Repository/Code

The code for reproducing all the experiments is available online at


Barber, R. F., Candès, E. J., Ramdas, A., & Tibshirani, R. J. (2019). The limits of distribution-free conditional predictive inference. arXiv.

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler & C. Wilson (Eds.), Proceedings of Machine Learning Research: Vol. 81. Proceedings of the 1st Conference on Fairness, Accountability and Transparency (pp. 77–91).

Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163.

Corbett-Davies, S. & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv.

Dieterich, W., Mendoza, C., & Brennan, T. (2016). Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (pp. 214–226). Association for Computing Machinery.

Executive Office of the President (2014). Big Data: Seizing Opportunities, Preserving Values. Createspace Independent Pub.

Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., & Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 259–268). Association for Computing Machinery.

Gardner, J., Brooks, C., Andres, J. M., & Baker, R. S. (2018). MORF: A framework for predictive modeling and replication at scale with privacy-restricted mooc data. In 2018 IEEE International Conference on Big Data (pp. 3235–3244).

Gardner, J., Brooks, C., & Baker, R. (2019). Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics and Knowledge (pp. 225–234). Association for Computing Machinery.

Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, 29, 3315–3323. Curran Associates, Inc.

Hebert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the (computationally-identifiable) masses. In J. Dy & A. Krause (Eds.), Proceedings of Machine Learning Research: Vol. 80. Proceedings of the 35th International Conference on Machine Learning (pp. 1939-1948).

Kaback, M. M., O’Brien, J. S., & Rimoin, D. L. (1977). Tay-Sachs disease: Screening and prevention. Alan R. Liss.

Kessler, M. D., Yerges-Armstrong, L., Taub, M. A., Shetty, A. C., Maloney, K., Jeng, L. J. B., Ruczinski, I., Levin, A. M., Williams, L. K., Beaty, T. H., Mathias, R. A., Barnes, K. C., Consortium on Asthma among African-Ancestry Populations in Americas (CAAPA), & O’Connor, T. D. (2016). Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nature Communications, 7(1), Article 12521.

Kim, M. P., Ghorbani, A., & Zou, J. (2019). Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 247–254). Association for Computing Machinery.

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv.

Kivaranovic, D., Johnson, K. D., & Leeb, H. (2019). Adaptive, distribution-free prediction intervals for deep neural networks. arXiv.

Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In Leibniz International Proceedings in Informatics: Vol. 67. Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017) (pp. 43:1–43:23).

Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50.

Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 30 (pp. 4066–4076). Curran Associates, Inc.

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111.

Lei, J., Robins, J., & Wasserman, L. (2013). Distribution-free prediction sets. Journal of the American Statistical Association, 108(501), 278–287.

Lei, J., & Wasserman, L. (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B, 76(1), 71–96.

Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.

Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002). Inductve confidence machines for regression. In T. Elomaa, H. Mannila,& H. Toivonen (Eds.), Lecture notes in computer science: Vol. 2430. ECML 2002: Machine Learning: ECML 2002 (pp 345-356). Springer.

Risch, N., Burchard, E., Ziv, E., & Tang, H. (2002). Categorization of humans in biomedical research: Genes, race and disease. Genome Biology, 3(7), comment2007.

Romano, Y., Patterson, E., & Candès, E. (2019). Conformalized quantile regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 32 (pp. 3543–3553). Curran Associates, Inc.

Rudin, C., Wang, C., & Coker, B. (2020). The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1).

Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56), 1929–1958.

Steinwart, I., & Christmann, A. (2011). Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1), 211–225.

Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 1231–1264.

Taylor, J. W. (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4), 299–311.<299::AID-FOR775>3.0.CO;2-V

Tibshirani, R. J., Foygel Barber, R., Candes, E., & Ramdas, A. (2019). Conformal prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 32 (pp. 2530–2540). Curran Associates, Inc.

Vovk, V. (2012). Conditional validity of inductive conformal predictors. In S. C. H. Hoi, & W. Buntine (Eds.), Proceedings of Machine Learning Research: Vol. 25. Proceedings of the Asian Conference on Machine Learning (pp. 475–490).

Vovk, V., Gammerman, A., & Saunders, C. (1999). Machine-learning applications of algorithmic randomness. In Proceedings of the 16th International Conference on Machine Learning (pp. 444–453). Morgan Kaufmann Publishers Inc.

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Springer.

Vovk, V., Nouretdinov, I., & Gammerman, A. (2009). On-line predictive linear regression. Annals of Statistics, 37(3), 1566–1590.

Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web (pp. 1171–1180).

Zafar, M. B., Valera, I., Gomez-Rodriguez, M., & Gummadi, K. P. (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(1), 2737–2778.

Zhou, K. Q., & Portnoy, S. L. (1996). Direct use of regression quantiles to construct confidence sets in linear models. Annals of Statistics, 24(1), 287–306.

Zhou, K. Q., & Portnoy, S. L. (1998). Statistical inference on heteroscedastic models based on regression quantiles. Journal of Nonparametric Statistics, 9(3), 239–260.

©2020 Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel J. Candès. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?