An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology, equalized coverage, is flexible as it can be viewed as a wrapper around any predictive algorithm. We test the applicability of the proposed framework on real data, demonstrating that equalized coverage constructs unbiased prediction intervals, unlike competitive methods.
Keywords: conformal prediction, calibration, uncertainty quantification, fairness, quantile regression, unbiased predictions.
Machine learning algorithms are increasingly deployed in sensitive applications to inform the selection of job candidates, to inform bail and parole decisions, and to filter loan application, among many others. Such practices have become the subject of intense scrutiny, as society must be concerned about whether these algorithms reinforce discrimination and make the status quo normative. A major area of study has therefore been to propose mathematical definitions of appropriate notions of fairness or algorithmic models of fairness, and to make sure that learned models comply with such prescriptions. Because fairness is rather ill defined, it has been reported that such notions can be incompatible with/or hurt the groups they intend to protect.
In this work, we follow the prescription of Corbett-Davies and Goel (2018) and decouple the statistical problem of risk assessment from the policy problem of taking actions or designing interventions. Rather than dictating policy, our aim will solely be the design of statistical algorithms providing the decision maker with information summarizing the knowledge that can be extracted from state-of-the-art machine learning systems in a way that is mathematically guaranteed to be unbiased regardless of a person’s protected attributes, providing an operational definition of fairness. This is accomplished by constructing prediction sets, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology is flexible in the sense that it can be wrapped around any predictive algorithm. For instance, in a stylized application where a recommendation system for college admission predicts the GPA of candidate students after 2 years of undergraduate education, our methodology would modify this system to produce, for each student, a range of values obeying two properties. First, the range contains the true outcome 90% of the time (or any other percentage). Second, this property holds regardless of the group to which the student belongs.
We are increasingly turning to machine learning systems to support human decisions. While decision makers may be subject to many forms of prejudice and bias, the promise and hope is that machines would be able to make more equitable decisions. Unfortunately, whether because they are fitted on already biased data or, otherwise, there are concerns that some of these data-driven recommendation systems treat members of different classes differently, perpetrating biases, providing different degrees of utilities, and inducing disparities. The examples that have emerged are quite varied:
Criminal justice: Courts in the United States may use COMPAS—a commercially available algorithm to assess a criminal defendant’s likelihood of becoming a recidivist—to help them decide who should receive parole, based on records collected through the criminal justice system. In 2016 ProPublica analyzed COMPAS and found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk (Dieterich et al., 2016).1, 2
Recognition system: A department of motor vehicles (DMV) may use facial recognition tools to detect people with false identities, by comparing driver’s license or ID photos with other DMV images on file. In a related context, Buolamwini and Gebru (2018) evaluated the performance of three commercial classification systems that employ facial images to predict individuals’ gender, and reported that the overall classification accuracy on male individuals was higher than female individuals. They also found that the predictive performance on lighter skinned individuals was higher than darker skinned individuals.
College admissions: A college admission office may be interested in a new algorithm for predicting the college GPA of a candidate student at the end of their sophomore year, by using features such as high school GPA, SAT scores, AP courses taken and scores, intended major, levels of physical activity, and so on. On a similar matter, the work reported in Gardner et al. (2019) studied various data-driven algorithms that aim to predict whether a student will drop out from a massive open online course (MOOC). Using a large data set available from Gardner et al. (2018), the authors found that in some cases there are noticeable differences between the models’ predictive performance on male students compared to female students.
Disease risk: Health care providers may be interested in predicting the chance that an individual develops certain disorders. Diseases with a genetic component have different frequencies in different human populations, reflecting the fact that disease-causing mutations arose at different times and in individuals residing in different areas: for example, Tay-Sachs disease is approximately 100 times more common in infants of Ashkenazi Jewish ancestry (Central-Eastern Europe) than in non-Jewish infants (Kaback et al., 1977). The genotyping of DNA polymorphisms can lead to more precise individual risk assessment than that derived from simply knowing to which ethnic group the individual belongs. However, given our still partial knowledge of the disease-causing mutations and their prevalence in different populations, the precision of these estimates varies substantially across ethnic groups. For instance, the study reported in Kessler et al. (2016) found a preference for European genetic variants over non-European variants in two genomic databases that are widely used by clinical geneticists (this reflects the fact that most studies have been conducted on European populations). Relying only on this information would result in predictions that are more accurate for individuals of European descent than for others.
The breadth of these examples underscores how data must be interpreted with care; the method that is advocated in this article is useful regardless of whether the disparity is due to factors of inequality/bias, or instead due to genetic risk. Indeed, policymakers have issued a call that Executive Office of the President (2014):
we must uphold our fundamental values so these systems are neither destructive nor opportunity limiting. [...] In order to ensure that growth in the use of data analytics is matched with equal innovation to protect the rights of Americans, it will be important to support research into mitigating algorithmic discrimination, building systems that support fairness and accountability, and developing strong data ethics frameworks.
This is a broad call that covers multiple aspects of data collection, mining, and interpretation; clearly, a response requires a multifaceted approach. Encouragingly, the machine learning community is beginning to respond to this challenge. A major area of study has been to propose mathematical definitions of appropriate notions of fairness (Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Zafar et al., 2017; Hebert-Johnson et al., 2018; Kim et al., 2019) or algorithmic models of fairness (Kusner et al., 2017). In many cases, these definitions are an attempt to articulate in mathematical terms what it means not to discriminate on the basis of protected characteristics; U.S. law identifies these as sex, race, age, disability, color, creed, national origin, religion, and genetic information. Now, discrimination can take many forms, and it is not surprising that it might be difficult to identify one analytical property that detects it in every context. Moreover, the call above is broader than the specific domains where discrimination is forbidden by law and invites us to develop analytical frameworks that guarantee an ethical use of data.
To begin to formalize the problem, it is useful to consider the task of predicting the value of , a binary variable, with a guess . We assume that represents a more favorable state, and that the value of will influence the decider, so that predicting for some individuals gives them an advantage. In this context, , the false negative rate, represents the probability with which an opportunity is denied to a well deserving individual. It is obvious that this is a critical error rate to control in scenarios such as deciding parole (see Example 1): freedom is a fundamental right, and nobody should be deprived of it needlessly. We then wish to require that is equal across values of the protected attribute (Hardt et al., 2016). In the case of distributions of goods (as when giving a loan), one might argue for parity of other measures such as , which would guarantee that resources are distributed equally across the different population categories (Dwork et al., 2012; Feldman et al., 2015; Zafar et al., 2019). Indeed, these observations are at the basis of two notions of fairness considered in the literature.
Researchers have noted several problems with fairness measures that ask for (approximate) parity of some statistical measure across all of these groups. Without providing a complete discussion, we list some of these problems here. (a) To begin with, it is usually unclear how to design algorithms that would actually obey these notions of fairness from finite samples, especially in situations where the outcome of interest or protected attribute is continuous. (b) Even if we could somehow operationalize the fairness program, these measures are usually incompatible: it is provably impossible to design an algorithm that obeys all notions of fairness simultaneously (Chouldechova, 2017; Kleinberg et al., 2017). (c) The appropriate measure of fairness appears to be context dependent. Consider Example 4 and suppose that corresponds to having Tay-Sachs, whose rate differs across populations. Due to the unbalanced nature of the disease, one would expect the predictive model to have a lower true positive rate for non-Jewish infants than that of Ashkenazi Jewish infants (for which the disease is much more common). Here, forcing parity of true positive rates (Hardt et al., 2016) would conflict with accurate predictions for each group (Hebert-Johnson et al., 2018). (d) Finally, and perhaps more importantly, researchers have argued that enforcing frequently discussed fairness criteria can often harm the very groups that these measures were designed to protect (Corbett-Davies and Goel, 2018).
In light of this, it has been suggested to decouple the statistical problem of risk assessment from the policy problem of taking actions and designing interventions. Quoting from Corbett-Davies and Goel (2018), “an algorithm might (correctly) infer that a defendant has a 20% chance of committing a violent crime if released, but that fact does not, in and of itself, determine a course of action.” Keeping away from policy then, how can we respond to the call in Executive Office of the President (2014) and provide a policymaker the best information gleaned from data while supporting equitable treatment? Our belief is that multiple approaches will be needed, and with this short article our aim is to introduce an additional tool to evaluate the performance of algorithms across different population groups.
One fundamental way to support data ethics is not to overstate the power of algorithms and data-based predictions, but rather always accompany these with measures of uncertainty that are easily understandable by the user. This can be done, for example, by providing a plausible range of predicted values for the outcome of interest. For instance, consider a recommendation system for college admission (Example 3), not knowing about the accuracy of the prediction algorithm, we would like to produce for, each student, a predicted GPA interval obeying the following two properties: First, the interval should be faithful in the sense that the true unknown outcome lies within the predicted range 90% of the time, say; second, this should be unbiased in that the average coverage should be the same within each group.
Such a predictive interval has the virtue of informing the decision maker about the evidence machine learning can provide while being explicit about the limits of predictive performance. If the interval is long, it just means that the predictive model can say little. Each group enjoys identical treatment, receiving equal coverage (e.g., 90%, or any level the decision maker wishes to achieve). Hence, the results of data analysis are unbiased to all. In particular, if the larger sample size available for one group overly influences the fit, leading to poor performance in the other groups, the prediction interval will make this immediately apparent through much wider confidence bands for the groups with fewer samples. Prediction intervals with equalized coverage, then, naturally assess and communicate the fact that an algorithm has varied levels of performance on different subgroups.
It seems impossible a priori to present information to the policymaker in such a compelling fashion without a strong model for dependence of the response on the features or protected attributes . In our college admission example, one may have trained a wide array of complicated predictive algorithms such as random forests or deep neural networks, each with its own suite of parameters; for all practical purposes, the fitting procedure may just as well be a black box. The surprise is that such a feat is possible under no assumption other than that of having samples that are drawn exchangeably—for example, they may be drawn i.i.d.—from a population of interest. We propose a concrete procedure, which acts as a wrapper around the predictive model, to produce valid prediction intervals that provably satisfy the equalized coverage constraint for any black box algorithm, sample size, and distribution. Such a procedure can be formulated by refining tools from conformal inference, a general methodology for constructing prediction intervals (Vovk et al., 1999; Papadopoulos et al., 2002; Vovk et al., 2005, 2009; Lei et al., 2013, 2018; Romano et al., 2019). Our contribution extends classical conformal inference as we seek a form of conditional rather than marginal coverage guarantee (Vovk, 2012; Lei and Wasserman, 2014; Barber et al., 2019).
The specific procedure we suggest to construct predictive intervals with equal coverage, then, supports equitable treatment in an additional dimension. Specifically, we use the same learning algorithm for all individuals, borrowing strength from the entire population, and leveraging the entire data set, while adjusting global predictions to make local confidence statements valid for each group. Such a training strategy may also improve the statistical efficiency of the predictive model, as illustrated by our experiments in Section 3. Of course, our approach comes with limitations as well: we discuss these and possible extensions in Section 4.
Let , , be some training data where the vector may contain the sensitive attribute as one of the features. Consider a test point with known and , we aim to construct a prediction interval which contains the unknown response , with probability at least on average within each group; here, is a desired coverage level. Our ideas extend to categorical responses in a fairly straightforward fashion; for brevity, we do not consider these extensions in this article. Interested readers will find details on how to build valid conformal prediction sets in classification problems in Shafer and Vovk (2008). Formally, we assume that the training and test samples are drawn exchangeably from some arbitrary and unknown distribution , and we wish that our prediction interval obeys the following property:
for all , where the probability is taken over the training samples and the test case. Once more, (1) must hold for any distribution , sample size , and regardless of the group identifier . (While this only ensures that coverage is at least for each group—and, therefore, the groups may have unequal coverage level—we will see that under mild conditions the coverage can also be upper bounded to lie very close to the target level .)
In this section we present a methodology to achieve (1). Our solution builds on classical conformal prediction (Vovk et al., 2005; Lei et al., 2018) and the recent conformalized quantile regression (CQR) approach (Romano et al., 2019) originally designed to construct marginal distribution-free prediction intervals (see also Kivaranovic et al., 2019). CQR combines the rigorous coverage guarantee of conformal prediction with the statistical efficiency of quantile regression (Koenker and Bassett, 1978) and has been shown to be adaptive to the local variability of the data distribution under study. Below, we present a modification of CQR obeying (1). Then in Section 2.2, we draw connections to conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and explain how classical conformal inference can also be used to construct prediction intervals with equal coverage across protected groups.3
Before describing the proposed method we introduce a key result in conformal prediction, adapted to our conditional setting. Variants of the following lemma appear in the literature (Vovk, 2012; Vovk et al., 2005; Lei et al., 2018; Tibshirani et al., 2019; Romano et al., 2019).
Lemma 1. Suppose the random variables are exchangeable conditional on , and define to be the
For any ,
Moreover, if the random variables are almost surely distinct, then it also holds that
Our method starts by randomly splitting the training points into two disjoint subsets; a proper training set and a calibration set . Then, consider any algorithm for quantile regression that estimates conditional quantile functions from observational data, such as quantile neural networks (Taylor, 2000) (described in Appendix). To construct a prediction interval with coverage, fit two conditional quantile functions on the proper training set,
at levels and , say, and form a first estimate of the prediction interval at . is constructed with the goal that a new case with covariates should have probability of its response lying in the interval , but the interval was empirically shown to under- or over- cover the test target variable (Romano et al., 2019). (Quantile regression algorithms are not supported by finite sample coverage guarantees [Steinwart and Christmann, 2011; Takeuchi et al., 2006; Meinshausen, 2006; Zhou and Portnoy, 1996, 1998].)
This motivates the next step that borrows ideas from split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and CQR (Romano et al., 2019). Consider a group , and compute the empirical errors (often called conformity scores) achieved by the first guess . This is done by extracting the calibration points that belong to that group,
This step provides a family of conformity scores that are restricted to the group . Each score measures the signed distance of the target variable to the boundary of the interval ; if is located outside the initial interval, then is equal to the distance to the closest interval endpoint. If lies inside the interval, then and its magnitude also equals the distance to the closest endpoint. As we shall see immediately below, these scores may serve to measure the quality of the initial guess and used to calibrate it as to obtain the desired distribution-free coverage. Crucially, our approach makes no assumptions on the form or the properties of —it may come from any model class, and is not required to meet any particular level of accuracy or coverage. Its role is to provide a base algorithm that effectively estimates the underlying uncertainty, around which we will build our predictive intervals.
Finally, the following crucial step builds a prediction interval for the unknown given and . This is done by computing
which is then used to calibrate the first interval estimate as follows:
Before proving the validity of the interval in (4), we pause to present two possible training strategies for the initial quantile regression interval We refer to the first as joint training as it uses the whole proper training set to learn a predictive model, see (2). The second approach, which we call groupwise training, constructs a prediction interval for separately for each group; that is, for each value , we fit a regression model to all training examples with . These two variants of the CQR procedure are summarized in Algorithm 1. While the statistical efficiency of the two approaches can differ (as we show in Section 3), both are guaranteed to attain valid group-conditional coverage for any data distribution and regardless of the choice or accuracy of the quantile regression estimate.
Theorem 1. If , are exchangeable, then the prediction interval constructed by Algorithm 1 obeys
for each group . Moreover, if the conformity scores for are almost surely distinct, then the group-conditional prediction interval is nearly perfectly calibrated:
for each group .
Proof. Fix any group . Since our calibration samples are exchangeable, the conformity scores (3) are also exchangeable. Exchangeability also holds when we add the test score to this list. Consequently, by Lemma 1,
where the upper bound holds under the additional assumption that are almost surely distinct, while the lower bound holds without this assumption.
To prove the validity of conditional on , observe that, by definition,
Hence, the result follows from (5).
Variant: Asymmetric group-conditional CQR. When the distribution of the conformity scores is highly skewed, the coverage error may spread asymmetrically over the left and right tails. In some applications it may be better to consider a variant of that controls the coverage of the two tails separately, leading to a stronger conditional coverage guarantee. To achieve this goal, we follow the approach from Romano et al. (2019) and evaluate two separate empirical quantile functions: one for the left tail,
and the second for the right tail
Next, we set and construct the interval for given and :
The validity of this procedure is stated below.
Theorem 2. Suppose the samples , are exchangeable. With the notation above, put and for short. Then
where the lower bounds above always hold while the upper bounds hold under the additional assumption that the residuals are almost surely distinct. Under these circumstances, the interval (5) obeys
Proof. As in the proof of Theorem 1, the validity of the lower and upper bounds is obtained by applying Lemma 1 twice.
The difference between CQR (Romano et al., 2019) and split conformal prediction (Papadopoulos et al., 2002) is that the former calibrates an estimated quantile regression interval , while the latter builds a prediction interval around an estimate of the conditional mean . For instance, can be formulated as a classical regression function estimate, obtained by minimizing the mean-squared-error loss over the proper training examples. To construct predictive intervals for the group , then simply replace both and with in (or in its two-tailed variant). The theorems go through, and this procedure gives predictive intervals with exactly the same guarantees as before. As we will see in our empirical results, a benefit of explicitly modeling quantiles is superior statistical efficiency.
The Medical Expenditure Panel Survey (MEPS) 2016 data set,4 provided by the Agency for Healthcare Research and Quality, contains information on individuals and their utilization of medical services. The features used for modeling include age, marital status, race, poverty status, functional limitations, health status, health insurance type, and more. We split these features into dummy variables to encode each category separately. The goal is to predict the health care system utilization of each individual; a score that reflects the number of visits to a doctor’s office, hospital visits, and so on. After removing observations with missing entries, there are observations on features. We set the sensitive attribute to race, with for nonwhite and for white individuals, resulting in samples for the first group and for the second. In all experiments we transform the response variable by as the raw score is highly skewed.
Below, we illustrate that empirical quantiles can be used to detect prediction bias. Next, we show that usual (marginal) conformal methods do not attain equal coverage across the two groups. Finally, we compare the performance of joint vs. groupwise model fitting and show that, in this example, the former yields shorter predictive intervals.
We randomly split the data into training (80%) and test (20%) sets and standardize the features to have zero mean and unit variance; the means and variances are computed using the training examples. Then we fit a neural network regression function on the training set, where the network architecture, optimization, and hyperparameters are similar to those described and implemented in Romano et al. (2019).5 Next, we compute the signed residuals of the test samples,
where , and plot the resulting empirical cumulative distribution functions and in Figure 1. Observe that . In particular, when comparing the two functions at , we see that overestimates the response of the non-White group and underestimates the response of the White group, as
Recall that the lower and upper quantiles of the signed residuals are used to construct valid group-conditional prediction intervals. While these must be evaluated on calibration examples (see next section), for illustrative purposes we present below the 0.05th and 0.95th quantiles of each group using the two cumulative distribution functions of test residuals. To this end, we denote by and the lower empirical quantiles of the non-White and White groups, defined to be the smallest numbers obeying the relationship
Following Figure 1, this pair is equal to
implying that for at least 5% of the test samples of each group, the fitted regression function overestimates the utilization of medical services with larger errors for White individuals than for non-White individuals.
As for the upper empirical quantiles, we compute the smallest and obeying
Here, in order to cover the target variable for White individuals at least 95% of the time, we should inflate the regression estimate by an additive factor equal to . For non-White individuals, the additive factor is smaller and equal to . This shows that systematically predicts higher utilization of non-White individuals relative to White individuals.
We now verify that our proposal constructs intervals with equal coverage across groups. Below, we set . To avoid the coverage errors to be spread arbitrarily over the left and right tails, we choose to control the two tails independently by setting in (5). We arbitrarily set the size of the proper training and calibration sets to be identical. (The features are standardized as discussed earlier.)
For our experiments, we test six different methods for producing conformal predictive intervals. We compare two types of constructions for the predictive interval:
Conformal prediction (CP), where the predictive interval is built around an estimated mean (as described in Section 2.2);
Conformalized quantile regression (CQR), where the predictive interval is constructed around initial estimates and of the lower and upper quantiles.
In both cases, we use a neural network to construct the models; we train the models using the software provided by Romano et al. (2019), using the same neural network design and learning strategy. For both the CP and CQR constructions, we then implement three versions:
Marginal coverage, where the intervals are constructed by pooling all the data together rather than splitting into subgroups according to the value of ;
Conditional coverage with groupwise models, where the initial model for the mean or for the quantiles is constructed separately for each group and ;
Conditional coverage with a joint model, where the initial model for the mean or for the quantiles is constructed pooling data across both groups and .
The results are summarized in Table 1, displaying the average length and coverage of the marginal and group-conditional conformal methods. These are evaluated on unseen test data and averaged over 40 train-test splits, where 80% of the samples are used for training (the calibration examples are a subset of the training data) and 20% for testing. All the conditional methods perfectly achieve 90% coverage per group (this is a theorem after all). On the other hand, the marginal CP method under-covers in the White group and overcovers in the non-White group (interestingly, though, the marginal CQR method almost attains equalized coverage even though it is not designed to give such a guarantee).
Turning to the statistical efficiency of the conditional conformal methods, we see that conditional CQR outperforms conditional CP in that it constructs shorter and, hence, more informative intervals, especially for the non-White group. The table also shows that the intervals for the White group are wider than those for the non-White group across all four conditional methods, and that joint model fitting is here more effective than groupwise model fitting as the former achieves shorter prediction intervals.
It is possible that the intervals constructed with our procedure have different lengths across groups. For example, our experiments show that, on average, the White group has wider intervals than the non-White group. Although one might argue that the different length distribution is in itself a type of unfairness, we want to caution the reader against assuming that a fair statistical procedure must necessarily produce intervals of equal length.
There are multiple aspects to consider. First, we believe that when there is a difference in performance across the protected groups, one needs to make this evident to the user and to understand the reasons behind it (we discuss below the issue with artificially forcing the two intervals to be of the same length). In some cases this difference might be reduced by improving the predictive algorithm, collecting more data for the population associated with poorer performance, introducing new features with higher predictive power, and so on. For example, in the context of studies that aim to predict disease risk on the basis of genetic features, it has become apparent that existing risk assessment tools suffer bias due to being constructed based on samples coming primarily from European populations; these tools will be much more effective if based on a larger sample that better reflects the diversity in the general population. It may also be the case that higher predictive precision in one group versus another may arise from bias, whether intentional or not, in the type of model we use, the choice of features we measure, or other aspects of our regression process—for example, if historically more emphasis was placed on finding accurate models for a particular group , then we may be measuring features that are highly informative for prediction within group while another group would be better served by measuring a different set of variables. Crucially, we do not want to mask this differential in information, but rather make it explicit—thereby possibly motivating the decision makers to take action.
We also note that in some cases, reducing the difference in performance might not be possible while increasing information. For example, the collection of a large enough sample for a minority population might be impossible due to privacy considerations and financial burden. Or the outcome in question might have structurally different variability across the groups. In such cases, equal length prediction intervals might be constructed only artificially, reducing the precision of the statements one can make for a given group—a choice that should be made with the participation of users and policymakers, rather than by data analysts alone.
The debate around fairness in general, and our proposal in particular, requires the definition of classes of individuals across which we would like an unbiased treatment. In some cases these coincide with protected attributes where discrimination on their basis is prohibited by the law. The legislation sometimes does not allow the decision maker to know/use the protected attribute in reaching a conclusion, as a measure to caution against discrimination. While no discrimination is a goal everyone should embrace regardless whether the law mandates it or not, we shall consider the opportunity of using protected attributes in data-driven recommendation systems. On the one hand, ignoring protected attributes is certainly not sufficient to guarantee absence of discrimination (see, e.g., Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Corbett-Davies and Goel, 2018; Buolamwini and Gebru, 2018; Gardner et al., 2019; Zafar et al., 2017; Chouldechova, 2017). On the other hand, information on protected attributes might be necessary to guarantee equitable treatment. Our procedure relies on the knowledge of protected attributes, so we want to expand on this last point a little. In absence of knowledge of what are the causal determinants of an outcome, protected attributes can be an important component of a predictor. To quote from Corbett-Davies and Goel (2018): in the criminal justice system, for example, women are typically less likely to commit a future violent crime than men with similar criminal histories. As a result, gender-neutral risk scores can systematically overestimate a woman’s recidivism risk, and can in turn encourage unnecessarily harsh judicial decisions. Recognizing this problem, some jurisdictions, like Wisconsin, have turned to gender-specific risk assessment tools to ensure that estimates are not biased against women. For disease risk assessment (Example 4 earlier) or related tasks such as diagnosis and drug prescription, race often provides relevant information and is routinely used. Presumably, once we understand the molecular basis of diseases and drug responses, and once sufficiently accurate measurements on patients are available, race may cease to be useful. Given present circumstances, however, Risch et al. (2002) argue that identical treatment is not equal treatment and that a race-neutral or color-blind approach to biomedical research is neither equitable nor advantageous, and would not lead to a reduction of disparities in disease risk or treatment efficacies between groups. In our context, the use of protected attributes allows a rigorous evaluation of the potentially biased performance for different groups.
Clearly, our current proposal can be adopted only when data on protected attributes has been collected; generalizations of the proposed methodology to situations where the group identifier is unknown are topics for further research.
We add to the tools that support fairness in data-driven recommendation systems by developing a highly operational method that can augment any prediction rule with the best available unbiased uncertainty estimates across groups. This is achieved by constructing prediction intervals that attain valid coverage regardless of the value of the sensitive attribute. The method is supported by rigorous coverage guarantees, as demonstrated on real-data examples. Although the focus of this article is on continuous response variables, one can adapt tools from conformal inference (Vovk et al., 2005) to construct prediction sets with equalized coverage for categorical target variables as well.
In this article, we have not discussed other measures of fairness: we believe an appropriate comparison would require much larger space, and would benefit from the inclusion of multiple voices. In evaluating the different proposals of the growing literature on algorithmic fairness, it might be useful to keep in mind a distinction between properties that should be required versus properties that are merely desirable. As an analogy, in statistical hypothesis testing, most commonly we require a bound on the false positive rate (Type I error); under this constraint, high power (low Type II error) is then desirable.
One century of statistical reasoning has taught us the importance of quantifying uncertainty and error. No algorithm should be ever deployed without a precise and intelligible description of the errors it makes and the statistical guarantees it offers. As practitioners know too well, it is most often not possible to guarantee that all errors are below a certain threshold. It becomes then crucial to select which statistical guarantee is most relevant for a problem, and fairness requires it to hold across different population groups. So, in the case of parole, we might think that the most crucial error to avoid is that of denying freedom to a deserving individual, and we should then enforce the probability of this error to be below the desired threshold in each population group. Or, as in the case of this article, we might want to provide the user with a 90% predictive interval for the GPA of a student, and we then need to require that its coverage is as advertised in each population. Equality in other measures of performance that have not been identified as primary (as the length of the predictive intervals) might then be desirable, but should not be prescribed and automatically pursued, without a conscious evaluation of the associated costs.
The knowledgeable reader will recognize that our approach is therefore different from the principle of equalized odds advocated in Hardt et al. (2016), which enforces that the two types of errors one can make in a binary classification problem must both be the same across the groups under study. (The cost is here that the algorithm would then need to change the predictions in at least one group to achieve the desired objective; this may be far from desirable and would not treat individuals equitably.) Returning to the distinction between a prescription and a wishlist, we make equalized coverage prescriptive. This does not mean that the data analyst cannot pay attention to other measures of fairness. For instance, she has the freedom to select predictive algorithms that score high on other metrics, for example, by adding empirical constraints to the construction of prediction sets (or intervals). We hope to report on progress in this direction in a future publication.
E. C. was partially supported by the Office of Naval Research (ONR) under grant N00014-16-1-2712, by the Army Research Office (ARO) under grant W911NF-17-1-0304, by the Math + X award from the Simons Foundation and by a generous gift from TwoSigma. Y. R. was partially supported by the ARO grant and by the same Math + X award. Y. R. thanks the Zuckerman Institute, ISEF Foundation and the Viterbi Fellowship, Technion, for providing additional research support. R. F. B. was partially supported by the National Science Foundation via grant DMS-1654076 and by an Alfred P. Sloan fellowship. C. S. was partially supported by the National Science Foundation via grant DMS-1712800.
Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2019). The limits of distribution-free conditional predictive inference. arXiv preprint arXiv:1903.04684.
Buolamwini, J. and Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Friedler, S. A. and Wilson, C., editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81, pages 77–91. PMLR.
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153–163.
Corbett-Davies, S. and Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023.
Dieterich, W., Mendoza, C., and Brennan, T. (2016). Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012). Fairness through aware- ness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. Association for Computing Machinery.
Executive Office of the President (2014). Big Data: Seizing Opportunities, Preserving Values. Createspace Independent Pub.
Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259-268. Association for Computing Machinery.
Gardner, J., Brooks, C., Andres, J. M., and Baker, R. S. (2018). MORF: A framework for predictive modeling and replication at scale with privacy-restricted mooc data. In 2018 IEEE International Conference on Big Data (Big Data), pages 3235–3244.
Gardner, J., Brooks, C., and Baker, R. (2019). Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics and Knowledge, pages 225-234. Association for Computing Machinery.
Hardt, M., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, 29, pages 3315–3323. Curran Associates, Inc.
Hebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. (2018). Multicalibration: Calibration for the (computationally-identifiable) masses. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, pages 1939–1948. PMLR.
Kaback, M. M., O’Brien, J. S., and Rimoin, D. L. (1977). Tay-Sachs disease: Screening and prevention. Alan R. Liss.
Kessler, M. D., Yerges-Armstrong, L., Taub, M. A., Shetty, A. C., Maloney, K., Jeng, L. J. B., Ruczinski, I., Levin, A. M., Williams, L. K., Beaty, T. H., Mathias, R. A., Barnes, K. C., Consortium on Asthma among African-ancestry Populations in Americas (CAAPA), and O’Connor, T. D. (2016). Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nature Communications, 7(1):12521.
Kim, M. P., Ghorbani, A., and Zou, J. (2019). Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254. Association for Computing Machinery.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kivaranovic, D., Johnson, K. D., and Leeb, H. (2019). Adaptive, distribution-free prediction intervals for deep neural networks. arXiv preprint arXiv:1905.10634.
Kleinberg, J., Mullainathan, S., and Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), volume 67, pages 43:1–43:23. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1):33–50.
Kusner, M. J., Loftus, J., Russell, C., and Silva, R. (2017). Counterfactual fairness. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4066–4076. Curran Associates, Inc.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111.
Lei, J., Robins, J., and Wasserman, L. (2013). Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278–287.
Lei, J. and Wasserman, L. (2014). Distribution-free prediction bands for non-parametric regres- sion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96.
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7:983–999.
Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductve confidence machines for regression. In Elomaa, T., Mannila, H., and Toivonen, H., editors, Machine Learning: European Conference on Machine Learning ECML 2002, pages 345–356. Springer Berlin Heidelberg.
Risch, N., Burchard, E., Ziv, E., and Tang, H. (2002). Categorization of humans in biomedical research: genes, race and disease. Genome Biology, 3(7):comment2007.1.
Romano, Y., Patterson, E., and Candès, E. (2019). Conformalized quantile regression. In Wal- lach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 3543–3553. Curran Associates, Inc.
Rudin, C., Wang, C., and Coker, B. (2020). The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1).
Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9:371–421.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
Steinwart, I. and Christmann, A. (2011). Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1):211–225.
Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7:1231–1264.
Taylor, J. W. (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299–311.
Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 2530–2540. Curran Associates, Inc.
Vovk, V. (2012). Conditional validity of inductive conformal predictors. In Hoi, S. C. H. and Buntine, W., editors, Proceedings of the Asian Conference on Machine Learning, volume 25, pages 475–490. PMLR.
Vovk, V., Gammerman, A., and Saunders, C. (1999). Machine-learning applications of algorith- mic randomness. In International Conference on Machine Learning, pages 444–453. Morgan Kaufmann Publishers Inc.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world. Springer.
Vovk, V., Nouretdinov, I., and Gammerman, A. (2009). On-line predictive linear regression. Annals of Statistics, 37(3):1566–1590.
Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171––1180.
Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(75):1– 42.
Zhou, K. Q. and Portnoy, S. L. (1996). Direct use of regression quantiles to construct confidence sets in linear models. Annals of Statistics, 24(1):287–306.
Zhou, K. Q. and Portnoy, S. L. (1998). Statistical inference on heteroscedastic models based on regression quantiles. Journal of Nonparametric Statistics, 9(3):239–260.
We follow Koenker and Bassett (1978) and cast the estimation problem of the conditional quantiles of as an optimization problem. Given training examples , we fit a parametric model using the pinball loss (Koenker and Bassett, 1978; Steinwart and Christmann, 2011), defined by
where is the output of a regression function formulated as a deep neural network. The network design and training algorithm are identical to those described in Romano et al. (2019)6. Specifically, we use a two-hidden-layer neural network, with ReLU nonlinearities. The hidden dimension of both layers is set to 64. We use Adam optimizer Kingma and Ba (2014), with minibatches of size and a fixed learning rate of . We employ weight decay regularization with parameter equal to and also use dropout (Srivastava et al., 2014) with a dropping rate of . We tune the number of epochs using cross validation (early stopping), with an upper limit of epochs.
Data . Nominal coverage level . Quantile regression algorithm . Training mode: joint/groupwise. Test point with sensitive attribute .
Randomly split into two disjoint sets and . If joint training: Fit quantile functions on the whole proper training set: . Else use groupwise training: Fit quantile functions on the proper training examples from group : . Compute for each , as in (3). Compute , the -th empirical quantile of .
Prediction interval for the unknown response .
Conditional CP (groupwise)
Conditional CP (joing)
Conditional CQR (groupwise)
Conditional CQR (joint)
Length and Coverage of Both Marginal and Group-Conditional Prediction Intervals () Constructed by Conformal Prediction (CP) and Conformalized Quantile Regression (CQR) for Medical Expenditure Panel Survey (MEPS) Data Set. Note. The results are averaged across 40 random train-test (80%/20%) splits. Groupwise—two independent predictive models are used, one for non-White and another for White individuals; joint—the same predictive model is used for all individuals. In all cases, the model is formulated as a neural network. The methods marked by an asterisk are not supported by a group-conditional coverage guarantee.
The code for reproducing all the experiments is available online at https://github.com/yromano/cqr.
This article is © 2020 by Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel J. Candès. The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.