An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology, equalized coverage, is flexible as it can be viewed as a wrapper around any predictive algorithm. We test the applicability of the proposed framework on real data, demonstrating that equalized coverage constructs unbiased prediction intervals, unlike competitive methods.
Keywords: conformal prediction, calibration, uncertainty quantification, fairness, quantile regression, unbiased predictions.
Machine learning algorithms are increasingly deployed in sensitive applications to inform the selection of job candidates, to inform bail and parole decisions, and to filter loan application, among many others. Such practices have become the subject of intense scrutiny, as society must be concerned about whether these algorithms reinforce discrimination and make the status quo normative. A major area of study has therefore been to propose mathematical definitions of appropriate notions of fairness or algorithmic models of fairness, and to make sure that learned models comply with such prescriptions. Because fairness is rather ill defined, it has been reported that such notions can be incompatible with/or hurt the groups they intend to protect.
In this work, we follow the prescription of Corbett-Davies and Goel (2018) and decouple the statistical problem of risk assessment from the policy problem of taking actions or designing interventions. Rather than dictating policy, our aim will solely be the design of statistical algorithms providing the decision maker with information summarizing the knowledge that can be extracted from state-of-the-art machine learning systems in a way that is mathematically guaranteed to be unbiased regardless of a person’s protected attributes, providing an operational definition of fairness. This is accomplished by constructing prediction sets, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology is flexible in the sense that it can be wrapped around any predictive algorithm. For instance, in a stylized application where a recommendation system for college admission predicts the GPA of candidate students after 2 years of undergraduate education, our methodology would modify this system to produce, for each student, a range of values obeying two properties. First, the range contains the true outcome 90% of the time (or any other percentage). Second, this property holds regardless of the group to which the student belongs.
We are increasingly turning to machine learning systems to support human decisions. While decision makers may be subject to many forms of prejudice and bias, the promise and hope is that machines would be able to make more equitable decisions. Unfortunately, whether because they are fitted on already biased data or, otherwise, there are concerns that some of these data-driven recommendation systems treat members of different classes differently, perpetrating biases, providing different degrees of utilities, and inducing disparities. The examples that have emerged are quite varied:
Criminal justice: Courts in the United States may use COMPAS—a commercially available algorithm to assess a criminal defendant’s likelihood of becoming a recidivist—to help them decide who should receive parole, based on records collected through the criminal justice system. In 2016 ProPublica analyzed COMPAS and found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk (Dieterich et al., 2016).1, 2
Recognition system: A department of motor vehicles (DMV) may use facial recognition tools to detect people with false identities, by comparing driver’s license or ID photos with other DMV images on file. In a related context, Buolamwini and Gebru (2018) evaluated the performance of three commercial classification systems that employ facial images to predict individuals’ gender, and reported that the overall classification accuracy on male individuals was higher than female individuals. They also found that the predictive performance on lighter skinned individuals was higher than darker skinned individuals.
College admissions: A college admission office may be interested in a new algorithm for predicting the college GPA of a candidate student at the end of their sophomore year, by using features such as high school GPA, SAT scores, AP courses taken and scores, intended major, levels of physical activity, and so on. On a similar matter, the work reported in Gardner et al. (2019) studied various data-driven algorithms that aim to predict whether a student will drop out from a massive open online course (MOOC). Using a large data set available from Gardner et al. (2018), the authors found that in some cases there are noticeable differences between the models’ predictive performance on male students compared to female students.
Disease risk: Health care providers may be interested in predicting the chance that an individual develops certain disorders. Diseases with a genetic component have different frequencies in different human populations, reflecting the fact that disease-causing mutations arose at different times and in individuals residing in different areas: for example, Tay-Sachs disease is approximately 100 times more common in infants of Ashkenazi Jewish ancestry (Central-Eastern Europe) than in non-Jewish infants (Kaback et al., 1977). The genotyping of DNA polymorphisms can lead to more precise individual risk assessment than that derived from simply knowing to which ethnic group the individual belongs. However, given our still partial knowledge of the disease-causing mutations and their prevalence in different populations, the precision of these estimates varies substantially across ethnic groups. For instance, the study reported in Kessler et al. (2016) found a preference for European genetic variants over non-European variants in two genomic databases that are widely used by clinical geneticists (this reflects the fact that most studies have been conducted on European populations). Relying only on this information would result in predictions that are more accurate for individuals of European descent than for others.
The breadth of these examples underscores how data must be interpreted with care; the method that is advocated in this article is useful regardless of whether the disparity is due to factors of inequality/bias, or instead due to genetic risk. Indeed, policymakers have issued a call that Executive Office of the President (2014):
we must uphold our fundamental values so these systems are neither destructive nor opportunity limiting. [...] In order to ensure that growth in the use of data analytics is matched with equal innovation to protect the rights of Americans, it will be important to support research into mitigating algorithmic discrimination, building systems that support fairness and accountability, and developing strong data ethics frameworks.
This is a broad call that covers multiple aspects of data collection, mining, and interpretation; clearly, a response requires a multifaceted approach. Encouragingly, the machine learning community is beginning to respond to this challenge. A major area of study has been to propose mathematical definitions of appropriate notions of fairness (Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Zafar et al., 2017; Hebert-Johnson et al., 2018; Kim et al., 2019) or algorithmic models of fairness (Kusner et al., 2017). In many cases, these definitions are an attempt to articulate in mathematical terms what it means not to discriminate on the basis of protected characteristics; U.S. law identifies these as sex, race, age, disability, color, creed, national origin, religion, and genetic information. Now, discrimination can take many forms, and it is not surprising that it might be difficult to identify one analytical property that detects it in every context. Moreover, the call above is broader than the specific domains where discrimination is forbidden by law and invites us to develop analytical frameworks that guarantee an ethical use of data.
To begin to formalize the problem, it is useful to consider the task of predicting the value of
Researchers have noted several problems with fairness measures that ask for (approximate) parity of some statistical measure across all of these groups. Without providing a complete discussion, we list some of these problems here. (a) To begin with, it is usually unclear how to design algorithms that would actually obey these notions of fairness from finite samples, especially in situations where the outcome of interest or protected attribute is continuous. (b) Even if we could somehow operationalize the fairness program, these measures are usually incompatible: it is provably impossible to design an algorithm that obeys all notions of fairness simultaneously (Chouldechova, 2017; Kleinberg et al., 2017). (c) The appropriate measure of fairness appears to be context dependent. Consider Example 4 and suppose that
In light of this, it has been suggested to decouple the statistical problem of risk assessment from the policy problem of taking actions and designing interventions. Quoting from Corbett-Davies and Goel (2018), “an algorithm might (correctly) infer that a defendant has a 20% chance of committing a violent crime if released, but that fact does not, in and of itself, determine a course of action.” Keeping away from policy then, how can we respond to the call in Executive Office of the President (2014) and provide a policymaker the best information gleaned from data while supporting equitable treatment? Our belief is that multiple approaches will be needed, and with this short article our aim is to introduce an additional tool to evaluate the performance of algorithms across different population groups.
One fundamental way to support data ethics is not to overstate the power of algorithms and data-based predictions, but rather always accompany these with measures of uncertainty that are easily understandable by the user. This can be done, for example, by providing a plausible range of predicted values for the outcome of interest. For instance, consider a recommendation system for college admission (Example 3), not knowing about the accuracy of the prediction algorithm, we would like to produce for, each student, a predicted GPA interval
Such a predictive interval has the virtue of informing the decision maker about the evidence machine learning can provide while being explicit about the limits of predictive performance. If the interval is long, it just means that the predictive model can say little. Each group enjoys identical treatment, receiving equal coverage (e.g., 90%, or any level the decision maker wishes to achieve). Hence, the results of data analysis are unbiased to all. In particular, if the larger sample size available for one group overly influences the fit, leading to poor performance in the other groups, the prediction interval will make this immediately apparent through much wider confidence bands for the groups with fewer samples. Prediction intervals with equalized coverage, then, naturally assess and communicate the fact that an algorithm has varied levels of performance on different subgroups.
It seems impossible a priori to present information to the policymaker in such a compelling fashion without a strong model for dependence of the response
The specific procedure we suggest to construct predictive intervals with equal coverage, then, supports equitable treatment in an additional dimension. Specifically, we use the same learning algorithm for all individuals, borrowing strength from the entire population, and leveraging the entire data set, while adjusting global predictions to make local confidence statements valid for each group. Such a training strategy may also improve the statistical efficiency of the predictive model, as illustrated by our experiments in Section 3. Of course, our approach comes with limitations as well: we discuss these and possible extensions in Section 4.
Let
for all
In this section we present a methodology to achieve (1). Our solution builds on classical conformal prediction (Vovk et al., 2005; Lei et al., 2018) and the recent conformalized quantile regression (CQR) approach (Romano et al., 2019) originally designed to construct marginal distribution-free prediction intervals (see also Kivaranovic et al., 2019). CQR combines the rigorous coverage guarantee of conformal prediction with the statistical efficiency of quantile regression (Koenker and Bassett, 1978) and has been shown to be adaptive to the local variability of the data distribution under study. Below, we present a modification of CQR obeying (1). Then in Section 2.2, we draw connections to conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and explain how classical conformal inference can also be used to construct prediction intervals with equal coverage across protected groups.3
Before describing the proposed method we introduce a key result in conformal prediction, adapted to our conditional setting. Variants of the following lemma appear in the literature (Vovk, 2012; Vovk et al., 2005; Lei et al., 2018; Tibshirani et al., 2019; Romano et al., 2019).
Lemma 1. Suppose the random variables
For any
Moreover, if the random variables
Our method starts by randomly splitting the
at levels
This motivates the next step that borrows ideas from split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) and CQR (Romano et al., 2019). Consider a group
and evaluating
This step provides a family of conformity scores
Finally, the following crucial step builds a prediction interval for the unknown
which is then used to calibrate the first interval estimate as follows:
Before proving the validity of the interval in (4), we pause to present two possible training strategies for the initial quantile regression interval
Theorem 1. If
for each group
for each group
Proof. Fix any group
where the upper bound holds under the additional assumption that
To prove the validity of
Hence, the result follows from (5).
Variant: Asymmetric group-conditional CQR. When the distribution of the conformity scores is highly skewed, the coverage error may spread asymmetrically over the left and right tails. In some applications it may be better to consider a variant of that controls the coverage of the two tails separately, leading to a stronger conditional coverage guarantee. To achieve this goal, we follow the approach from Romano et al. (2019) and evaluate two separate empirical quantile functions: one for the left tail,
and the second for the right tail
Next, we set
The validity of this procedure is stated below.
Theorem 2. Suppose the samples
and
where the lower bounds above always hold while the upper bounds hold under the additional assumption that the residuals are almost surely distinct. Under these circumstances, the interval (5) obeys
Proof. As in the proof of Theorem 1, the validity of the lower and upper bounds is obtained by applying Lemma 1 twice.
The difference between CQR (Romano et al., 2019) and split conformal prediction (Papadopoulos et al., 2002) is that the former calibrates an estimated quantile regression interval
The Medical Expenditure Panel Survey (MEPS) 2016 data set,4 provided by the Agency for Healthcare Research and Quality, contains information on individuals and their utilization of medical services. The features used for modeling include age, marital status, race, poverty status, functional limitations, health status, health insurance type, and more. We split these features into dummy variables to encode each category separately. The goal is to predict the health care system utilization of each individual; a score that reflects the number of visits to a doctor’s office, hospital visits, and so on. After removing observations with missing entries, there are
Below, we illustrate that empirical quantiles can be used to detect prediction bias. Next, we show that usual (marginal) conformal methods do not attain equal coverage across the two groups. Finally, we compare the performance of joint vs. groupwise model fitting and show that, in this example, the former yields shorter predictive intervals.
We randomly split the data into training (80%) and test (20%) sets and standardize the features to have zero mean and unit variance; the means and variances are computed using the training examples. Then we fit a neural network regression function
where
Recall that the lower and upper quantiles of the signed residuals are used to construct valid group-conditional prediction intervals. While these must be evaluated on calibration examples (see next section), for illustrative purposes we present below the 0.05th and 0.95th quantiles of each group using the two cumulative distribution functions of test residuals. To this end, we denote by
Following Figure 1, this pair is equal to
implying that for at least 5% of the test samples of each group, the fitted regression function
As for the upper empirical quantiles, we compute the smallest
and obtain
Here, in order to cover the target variable for White individuals at least 95% of the time, we should inflate the regression estimate by an additive factor equal to
We now verify that our proposal constructs intervals with equal coverage across groups. Below, we set
For our experiments, we test six different methods for producing conformal predictive intervals. We compare two types of constructions for the predictive interval:
Conformal prediction (CP), where the predictive interval is built around an estimated mean
Conformalized quantile regression (CQR), where the predictive interval is constructed around initial estimates
In both cases, we use a neural network to construct the models; we train the models using the software provided by Romano et al. (2019), using the same neural network design and learning strategy. For both the CP and CQR constructions, we then implement three versions:
Marginal coverage, where the intervals
Conditional coverage with groupwise models, where the initial model for the mean
Conditional coverage with a joint model, where the initial model for the mean
The results are summarized in Table 1, displaying the average length and coverage of the marginal and group-conditional conformal methods. These are evaluated on unseen test data and averaged over 40 train-test splits, where 80% of the samples are used for training (the calibration examples are a subset of the training data) and 20% for testing. All the conditional methods perfectly achieve 90% coverage per group (this is a theorem after all). On the other hand, the marginal CP method under-covers in the White group and overcovers in the non-White group (interestingly, though, the marginal CQR method almost attains equalized coverage even though it is not designed to give such a guarantee).
Turning to the statistical efficiency of the conditional conformal methods, we see that conditional CQR outperforms conditional CP in that it constructs shorter and, hence, more informative intervals, especially for the non-White group. The table also shows that the intervals for the White group are wider than those for the non-White group across all four conditional methods, and that joint model fitting is here more effective than groupwise model fitting as the former achieves shorter prediction intervals.
It is possible that the intervals constructed with our procedure have different lengths across groups. For example, our experiments show that, on average, the White group has wider intervals than the non-White group. Although one might argue that the different length distribution is in itself a type of unfairness, we want to caution the reader against assuming that a fair statistical procedure must necessarily produce intervals of equal length.
There are multiple aspects to consider. First, we believe that when there is a difference in performance across the protected groups, one needs to make this evident to the user and to understand the reasons behind it (we discuss below the issue with artificially forcing the two intervals to be of the same length). In some cases this difference might be reduced by improving the predictive algorithm, collecting more data for the population associated with poorer performance, introducing new features with higher predictive power, and so on. For example, in the context of studies that aim to predict disease risk on the basis of genetic features, it has become apparent that existing risk assessment tools suffer bias due to being constructed based on samples coming primarily from European populations; these tools will be much more effective if based on a larger sample that better reflects the diversity in the general population. It may also be the case that higher predictive precision in one group versus another may arise from bias, whether intentional or not, in the type of model we use, the choice of features we measure, or other aspects of our regression process—for example, if historically more emphasis was placed on finding accurate models for a particular group
We also note that in some cases, reducing the difference in performance might not be possible while increasing information. For example, the collection of a large enough sample for a minority population might be impossible due to privacy considerations and financial burden. Or the outcome in question might have structurally different variability across the groups. In such cases, equal length prediction intervals might be constructed only artificially, reducing the precision of the statements one can make for a given group—a choice that should be made with the participation of users and policymakers, rather than by data analysts alone.
The debate around fairness in general, and our proposal in particular, requires the definition of classes of individuals across which we would like an unbiased treatment. In some cases these coincide with protected attributes where discrimination on their basis is prohibited by the law. The legislation sometimes does not allow the decision maker to know/use the protected attribute in reaching a conclusion, as a measure to caution against discrimination. While no discrimination is a goal everyone should embrace regardless whether the law mandates it or not, we shall consider the opportunity of using protected attributes in data-driven recommendation systems. On the one hand, ignoring protected attributes is certainly not sufficient to guarantee absence of discrimination (see, e.g., Dwork et al., 2012; Hardt et al., 2016; Dieterich et al., 2016; Corbett-Davies and Goel, 2018; Buolamwini and Gebru, 2018; Gardner et al., 2019; Zafar et al., 2017; Chouldechova, 2017). On the other hand, information on protected attributes might be necessary to guarantee equitable treatment. Our procedure relies on the knowledge of protected attributes, so we want to expand on this last point a little. In absence of knowledge of what are the causal determinants of an outcome, protected attributes can be an important component of a predictor. To quote from Corbett-Davies and Goel (2018): in the criminal justice system, for example, women are typically less likely to commit a future violent crime than men with similar criminal histories. As a result, gender-neutral risk scores can systematically overestimate a woman’s recidivism risk, and can in turn encourage unnecessarily harsh judicial decisions. Recognizing this problem, some jurisdictions, like Wisconsin, have turned to gender-specific risk assessment tools to ensure that estimates are not biased against women. For disease risk assessment (Example 4 earlier) or related tasks such as diagnosis and drug prescription, race often provides relevant information and is routinely used. Presumably, once we understand the molecular basis of diseases and drug responses, and once sufficiently accurate measurements on patients are available, race may cease to be useful. Given present circumstances, however, Risch et al. (2002) argue that identical treatment is not equal treatment and that a race-neutral or color-blind approach to biomedical research is neither equitable nor advantageous, and would not lead to a reduction of disparities in disease risk or treatment efficacies between groups. In our context, the use of protected attributes allows a rigorous evaluation of the potentially biased performance for different groups.
Clearly, our current proposal can be adopted only when data on protected attributes has been collected; generalizations of the proposed methodology to situations where the group identifier is unknown are topics for further research.
We add to the tools that support fairness in data-driven recommendation systems by developing a highly operational method that can augment any prediction rule with the best available unbiased uncertainty estimates across groups. This is achieved by constructing prediction intervals that attain valid coverage regardless of the value of the sensitive attribute. The method is supported by rigorous coverage guarantees, as demonstrated on real-data examples. Although the focus of this article is on continuous response variables, one can adapt tools from conformal inference (Vovk et al., 2005) to construct prediction sets with equalized coverage for categorical target variables as well.
In this article, we have not discussed other measures of fairness: we believe an appropriate comparison would require much larger space, and would benefit from the inclusion of multiple voices. In evaluating the different proposals of the growing literature on algorithmic fairness, it might be useful to keep in mind a distinction between properties that should be required versus properties that are merely desirable. As an analogy, in statistical hypothesis testing, most commonly we require a bound on the false positive rate (Type I error); under this constraint, high power (low Type II error) is then desirable.
One century of statistical reasoning has taught us the importance of quantifying uncertainty and error. No algorithm should be ever deployed without a precise and intelligible description of the errors it makes and the statistical guarantees it offers. As practitioners know too well, it is most often not possible to guarantee that all errors are below a certain threshold. It becomes then crucial to select which statistical guarantee is most relevant for a problem, and fairness requires it to hold across different population groups. So, in the case of parole, we might think that the most crucial error to avoid is that of denying freedom to a deserving individual, and we should then enforce the probability of this error to be below the desired threshold in each population group. Or, as in the case of this article, we might want to provide the user with a 90% predictive interval for the GPA of a student, and we then need to require that its coverage is as advertised in each population. Equality in other measures of performance that have not been identified as primary (as the length of the predictive intervals) might then be desirable, but should not be prescribed and automatically pursued, without a conscious evaluation of the associated costs.
The knowledgeable reader will recognize that our approach is therefore different from the principle of equalized odds advocated in Hardt et al. (2016), which enforces that the two types of errors one can make in a binary classification problem must both be the same across the groups under study. (The cost is here that the algorithm would then need to change the predictions in at least one group to achieve the desired objective; this may be far from desirable and would not treat individuals equitably.) Returning to the distinction between a prescription and a wishlist, we make equalized coverage prescriptive. This does not mean that the data analyst cannot pay attention to other measures of fairness. For instance, she has the freedom to select predictive algorithms that score high on other metrics, for example, by adding empirical constraints to the construction of prediction sets (or intervals). We hope to report on progress in this direction in a future publication.
E. C. was partially supported by the Office of Naval Research (ONR) under grant N00014-16-1-2712, by the Army Research Office (ARO) under grant W911NF-17-1-0304, by the Math + X award from the Simons Foundation and by a generous gift from TwoSigma. Y. R. was partially supported by the ARO grant and by the same Math + X award. Y. R. thanks the Zuckerman Institute, ISEF Foundation and the Viterbi Fellowship, Technion, for providing additional research support. R. F. B. was partially supported by the National Science Foundation via grant DMS-1654076 and by an Alfred P. Sloan fellowship. C. S. was partially supported by the National Science Foundation via grant DMS-1712800.
We follow Koenker and Bassett (1978) and cast the estimation problem of the conditional quantiles of
where
Input:
Data
Process:
Randomly split
Output:
Prediction interval
Method | Group | Average Coverage | Average Length |
---|---|---|---|
Non-White | 0.920 | 2.907 | |
*Marginal CP | White | 0.871 | 2.907 |
Non-White | 0.903 | 2.764 | |
Conditional CP (groupwise) | White | 0.901 | 3.182 |
Non-White | 0.904 | 2.738 | |
Conditional CP (joing) | White | 0.902 | 3.150 |
Non-White | 0.905 | 2.530 | |
*Marginal CQR | White | 0.894 | 3.081 |
Non-White | 0.904 | 2.567 | |
Conditional CQR (groupwise) | White | 0.900 | 3.203 |
Non-White | 0.902 | 2.527 | |
Conditional CQR (joint) | White | 0.901 | 3.102 |
Note: The results are averaged across 40 random train-test (80%/20%) splits. Groupwise—two independent predictive models are used, one for non-White and another for White individuals; joint—the same predictive model is used for all individuals. In all cases, the model is formulated as a neural network. The methods marked by an asterisk are not supported by a group-conditional coverage guarantee.
The code for reproducing all the experiments is available online at https://github.com/yromano/cqr
Barber, R. F., Candès, E. J., Ramdas, A., & Tibshirani, R. J. (2019). The limits of distribution-free conditional predictive inference. arXiv. https://doi.org/10.48550/arXiv.1903.04684
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler & C. Wilson (Eds.), Proceedings of Machine Learning Research: Vol. 81. Proceedings of the 1st Conference on Fairness, Accountability and Transparency (pp. 77–91). http://proceedings.mlr.press/v81/buolamwini18a.html
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163. https://doi.org/10.1089/big.2016.0047
Corbett-Davies, S. & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv. https://doi.org/10.48550/arXiv.1808.00023
Dieterich, W., Mendoza, C., & Brennan, T. (2016). Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (pp. 214–226). Association for Computing Machinery. https://doi.org/10.1145/2090236.2090255
Executive Office of the President (2014). Big Data: Seizing Opportunities, Preserving Values. Createspace Independent Pub.
Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., & Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 259–268). Association for Computing Machinery. https://doi.org/10.1145/2783258.2783311
Gardner, J., Brooks, C., Andres, J. M., & Baker, R. S. (2018). MORF: A framework for predictive modeling and replication at scale with privacy-restricted mooc data. In 2018 IEEE International Conference on Big Data (pp. 3235–3244). http://doi.org/10.1109/BigData.2018.8621874
Gardner, J., Brooks, C., & Baker, R. (2019). Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics and Knowledge (pp. 225–234). Association for Computing Machinery. https://doi.org/10.1145/3303772.3303791
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, 29, 3315–3323. Curran Associates, Inc.
Hebert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018). Multicalibration: Calibration for the (computationally-identifiable) masses. In J. Dy & A. Krause (Eds.), Proceedings of Machine Learning Research: Vol. 80. Proceedings of the 35th International Conference on Machine Learning (pp. 1939-1948). https://proceedings.mlr.press/v80/hebert-johnson18a.html
Kaback, M. M., O’Brien, J. S., & Rimoin, D. L. (1977). Tay-Sachs disease: Screening and prevention. Alan R. Liss.
Kessler, M. D., Yerges-Armstrong, L., Taub, M. A., Shetty, A. C., Maloney, K., Jeng, L. J. B., Ruczinski, I., Levin, A. M., Williams, L. K., Beaty, T. H., Mathias, R. A., Barnes, K. C., Consortium on Asthma among African-Ancestry Populations in Americas (CAAPA), & O’Connor, T. D. (2016). Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nature Communications, 7(1), Article 12521. https://doi.org/10.1038/ncomms12521
Kim, M. P., Ghorbani, A., & Zou, J. (2019). Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 247–254). Association for Computing Machinery. https://doi.org/10.1145/3306618.3314287
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv. https://doi.org/10.48550/arXiv.1412.6980
Kivaranovic, D., Johnson, K. D., & Leeb, H. (2019). Adaptive, distribution-free prediction intervals for deep neural networks. arXiv. https://doi.org/10.48550/arXiv.1905.10634
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In Leibniz International Proceedings in Informatics: Vol. 67. Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017) (pp. 43:1–43:23). https://doi.org/10.4230/LIPIcs.ITCS.2017.43
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50. https://doi.org/10.2307/1913643
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 30 (pp. 4066–4076). Curran Associates, Inc.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111. http://doi.org/10.1080/01621459.2017.1307116
Lei, J., Robins, J., & Wasserman, L. (2013). Distribution-free prediction sets. Journal of the American Statistical Association, 108(501), 278–287. https://doi.org/10.1080/01621459.2012.751873
Lei, J., & Wasserman, L. (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B, 76(1), 71–96. http://doi.org/10.1111/rssb.12021
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.
Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002). Inductve confidence machines for regression. In T. Elomaa, H. Mannila,& H. Toivonen (Eds.), Lecture notes in computer science: Vol. 2430. ECML 2002: Machine Learning: ECML 2002 (pp 345-356). Springer. https://doi.org/10.1007/3-540-36755-1_29
Risch, N., Burchard, E., Ziv, E., & Tang, H. (2002). Categorization of humans in biomedical research: Genes, race and disease. Genome Biology, 3(7), comment2007. https://doi.org/10.1186/gb-2002-3-7-comment2007
Romano, Y., Patterson, E., & Candès, E. (2019). Conformalized quantile regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 32 (pp. 3543–3553). Curran Associates, Inc.
Rudin, C., Wang, C., & Coker, B. (2020). The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.6ed64b30
Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56), 1929–1958.
Steinwart, I., & Christmann, A. (2011). Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1), 211–225. https://doi.org/10.3150/10-BEJ267
Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 1231–1264.
Taylor, J. W. (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4), 299–311. https://doi.org/10.1002/1099-131X(200007)19:4<299::AID-FOR775>3.0.CO;2-V
Tibshirani, R. J., Foygel Barber, R., Candes, E., & Ramdas, A. (2019). Conformal prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems: Vol 32 (pp. 2530–2540). Curran Associates, Inc. http://papers.nips.cc/paper/8522-conformal-prediction-under-covariate-shift.pdf
Vovk, V. (2012). Conditional validity of inductive conformal predictors. In S. C. H. Hoi, & W. Buntine (Eds.), Proceedings of Machine Learning Research: Vol. 25. Proceedings of the Asian Conference on Machine Learning (pp. 475–490). https://doi.org/10.1007/s10994-013-5355-6
Vovk, V., Gammerman, A., & Saunders, C. (1999). Machine-learning applications of algorithmic randomness. In Proceedings of the 16th International Conference on Machine Learning (pp. 444–453). Morgan Kaufmann Publishers Inc.
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Springer. http://doi.org/10.1007/b106715
Vovk, V., Nouretdinov, I., & Gammerman, A. (2009). On-line predictive linear regression. Annals of Statistics, 37(3), 1566–1590. https://doi.org/10.1214/08-AOS622
Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web (pp. 1171–1180). https://doi.org/10.1145/3038912.3052660
Zafar, M. B., Valera, I., Gomez-Rodriguez, M., & Gummadi, K. P. (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research, 20(1), 2737–2778.
Zhou, K. Q., & Portnoy, S. L. (1996). Direct use of regression quantiles to construct confidence sets in linear models. Annals of Statistics, 24(1), 287–306.
Zhou, K. Q., & Portnoy, S. L. (1998). Statistical inference on heteroscedastic models based on regression quantiles. Journal of Nonparametric Statistics, 9(3), 239–260. https://doi.org/10.1080/10485259808832745
©2020 Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel J. Candès. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.