This piece is a commentary on the article, “The Age of Secrecy and Unfairness in Recidivism Prediction.”
I first wish to congratulate Rudin, Wang, and Coker (2020, hereafter RWC) for a thorough analysis of COMPAS with the Broward County data. This is a much-needed analysis, as the ProPublica analysis (Angwin, Larson, Mattu, & Kirchner, 2016) has led to misconceptions about risk and needs assessments, while RWC bring into sharp focus the fundamental problem, lack of transparency.
RWC’s article shows that advances in science come with transparency, the careful use of statistics as a part of data science, and a clear understanding of the application area (the justice system in this case).
The justice system is not transparent. Arrest decisions, bail assignments, pretrial detention decisions, plea bargain offers, and the sizes of fines and lengths of sentences all occur in an opaque process involving the police, prosecutors, magistrates, judges, and, of course, the defendants. Some of these decisions occur in court hearings with publicly accessible, complete transcripts, while others, such as plea negotiations, can occur by private text messages between a prosecutor and defense counsel. Each of the individuals working in the justice system may have clear principles in mind, but disparities and variability in justice system outcomes lead to the perception that the system is random or even unfair. Tools such as risk assessments and sentencing guidelines are intended to provide some standardization of justice system decisions.
The Pennsylvania sentencing guidelines suggest a prison term of at least 4 years for forcible rape and a prison term of at least 5 years for trafficking more than a kilogram of cocaine. We can argue whether the lengths of these terms are sensible, whether cocaine trafficking should have a more severe sentence than rape, and what aggravating circumstances should result in sentences above the minimums. Importantly, the sentencing guidelines are transparent. We all can see the relationship between the criminal offense and the recommended sentence.
Research has shown that sentencing guidelines change judges’ behaviors. Bushway, Owens, and Piehl (2012) found that 10% of the worksheets used to extract sentencing recommendations for specific cases in Maryland had errors. These errors caused the calculated sentencing guidelines for prison sentences to be between one year too short to two years too long. Judges, unaware of these errors, imposed sentences that were shorter or longer depending on the direction of the errors. Data quality matters when producing decision-support tools, but it also shows that such tools are powerful influencers of judges. When public and political awareness of the sentencing disparity between crack and powder cocaine grew, legislation in Congress and executive action by the U.S. Sentencing Commission modified the sentencing guidelines in 2007, 2010, and 2018 and applied them retroactively in an effort to undue past disparities. Transparent tools such as sentencing guidelines and risk assessments can be used to encode our values, implement our evolving views on the criminalization of drug abuse, and push justice system reform.
As with the crack/powder cocaine disparity in sentencing guidelines now being undone, concerns abound that automated risk assessment tools are hardcoding historic (or even persisting) injustices into the justice system. COMPAS is an obvious target since its use is widespread, it generates profits for its owners, and, as RWC point out, it is not transparent. However, using the current justice system as a benchmark, COMPAS does not make our justice system less transparent. Instead, COMPAS is actually a small step forward from the murky workings of the justice system. At least with COMPAS we know its complete set of inputs and can have this discussion about their relationship with its outputs.
Like the justice system, the lack of transparency makes COMPAS easy to misunderstand and for the journalists at ProPublica to assume that differences in error rates must mean that “it’s biased against blacks” (Angwin, Larson, Mattu, & Kirchner, 2016). It truly might be biased, but COMPAS would be more easily diagnosed if it was transparent. But ProPublica’s findings could be yet another case of Simpson’s paradox. RWC show convincingly that appropriately accounting for age leads to the conclusion that COMPAS “has at most weak dependence on race.” A simple numerical example can also show that observed disparities could be explained by confounding.
Table 1 shows ProPublica’s COMPAS error rates for White and Black defendants in Broward County. It shows that about 24% of ‘high-risk’ white defendants did not reoffend while 45% of ‘high-risk’ black defendants did not reoffend. A similar disparity holds for defendants that COMPAS rated ‘low risk.’
Table 1. COMPAS Prediction Fails Differently for Black Defendants
Labeled Higher Risk, But Didn’t Reoffend
Labeled Lower Risk, Yet Did Reoffend
In Table 2 I have imagined four crime categories that have racial imbalances. Black defendants are more likely to be arrested and charged for crime categories 1 and 3, while White defendants are more likely to be charged for crimes 2 and 4. The risk of reoffending varies depending on the crime category, but, importantly, it does not vary by race. Regardless of race, defendants facing charges in crime categories 1 and 2 are considered low risk (less than 50% chance of reoffending) and those facing charges in crime categories 3 and 4 are considered high risk.
Table 2. Simulated Example With Race-Neutral Risk Prediction That Reproduces ProPublica’s Error Rates
Number of White defendants
Number of Black defendants
Probability of reoffending, independent of race
High when risk is greater than 50%
I have designed these reoffense probabilities to reproduce ProPublica’s error rates. To see this, calculate for White and Black defendants separately; how many of them will be classified as high or low risk and how many will go on to reoffend? Table 3 shows the expected counts.
Table 3. Expected Counts of Cases Classified by Race, Risk Prediction, and Future Reoffense
High risk predicted
Low risk predicted
855×0.54 + 2,021×0.86
103×0.26 + 2,021×0.49
Did not reoffend
855×0.46 + 2,021×0.14
103×0.74 + 2,021×0.51
2,347×0.54 + 83×0.86
2,347×0.26 + 224×0.49
Did not reoffend
2,347×0.46 + 83×0.14
2,347×0.74 + 224×0.51
Finally, Table 4 shows the expected error rates from this simulated example. Each error rate is nearly identical to the error rate ProPublica produced, but in this simulated example we know that the risk prediction is race neutral. It depends only on the crime category and the true risk of reoffense based on that crime category. This simple example involved only four crime categories when in reality there are many more. Also, the counts and rates shown in Table 2 are not a unique set of numbers that reproduce ProPublica’s error rates. This shows that racial disparities in error rates in the aggregate may be the result of confounding.
Table 4. Error Rates Observed From a Race-Neutral Risk Assessment
Labeled Higher Risk, But Didn’t Reoffend
Labeled Lower Risk, Yet Did Reoffend
To their credit, the data scientists at ProPublica made all of their data and analysis code available… completely transparent. This has allowed several teams in addition to RWC to probe their findings. Transparency has made it possible for the science of risk assessment to advance.
I take three key lessons from RWC’s article.
First, transparency reduces the risk of misunderstanding. Unlike most statistical problems in which the unknown parameter can never really be fully known, COMPAS is knowable if only equivant (formerly Northpointe) would reveal its structure. Only then can we be as sure over what we are arguing as we can with sentencing guidelines. ProPublica’s linearity assumption likely led to an incorrect conclusion about racial bias, something completely avoidable if the COMPAS structure were known.
Second, RWC’s analysis shows that statistics is an integral part of data science. As RWC unravel the ProPublica analysis in their article, it becomes clear that understanding COMPAS in Broward County required a world-class team of data scientists and statisticians in order to get closer to the truth. Data manipulation tools and a few basic modeling techniques resulted in ProPublica data scientists stumbling into Simpson’s paradox without recognizing the risk.
Lastly, data science and statistics cannot ignore the need to have a clear understanding of the application, the justice system in this case. In some moments, RWC misunderstand part of COMPAS’s purpose. COMPAS is a risk and needs assessment. Needs assessments are designed to help align treatment services with an offenders’ key issues that likely lead to criminal behavior. RWC suggest that questions like “How hard is it for you to find a job ABOVE minimum wage compared to others?” should not be permitted. It would not be appropriate for assessing criminal risk, but it is an excellent question to understand whether a probation program with a strong job training component would be advisable. COMPAS also assesses history of victimization and abuse, financial instability, access to shelter and food assistance, substance abuse, and other needs. These are essential for developing a comprehensive plan for offenders that go beyond sentencing decisions and should not be lost in the narrower discussion of risk assessment.
If we do not push forward with the kinds of thorough analyses demonstrated in this article and continue to advance risk prediction tools that are transparent and encode our justice system values, then, in the words of the authors, “we could be stuck in the situation where a human decision maker provides nontransparent, potentially biased decisions.”
Read invited commentary by:
Shawn Bushway (The RAND Corporation)
Alexandra Chouldechova (Carnegie Mellon University)
Sarah Desmarais (North Carolina State University)
Brandon L. Garrett (Duke University School of Law)
Eugenie Jackson and Christina Mendoza (Northpointe, Inc.)
Read a rejoinder by: Cynthia Rudin, Caroline Wang, and Beau Coker
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (May 23, 2016). Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks., ProPublica. Retrieved February 21, 2020 from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Bushway, S., Owens, E., & Piehl, A. (2012). Sentencing guidelines and judicial discretion: Quasi-Experimental evidence from human calculation errors. Journal of Empirical Legal Studies, 9, 291–319. https://doi.org/10.1111/j.1740-1461.2012.01254.x
Rudin, C., Wang, C., & Coker, B. (2020). The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1).
This article is © 2020 by Greg Ridgeway. The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the author identified above.