This piece is a commentary on the article, The Age of Secrecy and Unfairness in Recidivism Prediction
The Justice in Forensic Algorithms Act introduced in Congress in Fall 2019, responds to concerns regarding lack of transparency in the use of such algorithms with a law that would require that source code be made available to criminal defendants in all cases in which such algorithms are used (Justice in Forensic Algorithms Act of 2019). The act is intended to protect criminal defendant’s due process rights by prohibiting use of trade secrets privileges to prevent access to and challenges to such evidence. The act goes farther to provide that the National Institute of Standards and Technology (NIST) be directed to establish two new programs: the Computational Forensic Algorithms Standards, and a Computational Forensic Algorithms Testing program that would assess any forensic algorithms before approving them for use by federal law enforcement. This proposed legislation sets out one possible framework for regulating the use of algorithms in criminal justice. Unfortunately, not only may such measures be unlikely to be enacted, but even if they were, far more will be needed to safeguard against the use of untested technology in criminal cases.
I write to comment on “The Age of Secrecy and Unfairness in Recidivism Prediction,” a wonderful new piece by Cynthia Rudin, Caroline Wang, and Beau Coker (2020). The authors argue secret algorithms should not be permitted to make important decisions, such as calculations of criminal sentences, and further, they partially reconstruct one such secret algorithm to show what should and should not trouble us about use in practice. Those empirical contributions add important insights into debates about the use of the COMPAS, an acronym for Correctional Offender Management Profiling for Alternative Sanctions, a case management tool developed by a company now called equivant (previously Northpointe). To be sure, many other risk assessment instruments commonly used are not proprietary or secret and they are not particularly complex. Thus, the findings regarding the one particular (and much criticized) instrument regarding data from one county (Broward County, Florida) do not necessarily extend to other uses of risk assessment.
Unfortunately, though, the concerns raised by the authors regarding COMPAS extend still more broadly to a range of algorithms used in other criminal justice settings. Algorithms have been put to uses in criminal justice ranging from facial recognition, to DNA mixture analysis, to biometric database searches. Nonalgorithmic forensic methods also continue to be used, relying on black box or unvalidated methods. When one looks more broadly at the uses of unsound science in criminal cases, one can observe that a regulatory scheme, perhaps along the lines of the federal legislation introduced, is urgently needed.
Rudin, Wang, and Coker’s article focuses on COMPAS, and they make surprising and important empirical findings concerning its design. COMPAS is used in Broward County, Florida, regarding pretrial decisions on whether to release a person facing criminal charges pending trial or impose some amount of cash bond. A judge is not required to follow the recommendation made by the risk instrument; the risk instrument informs the judge’s discretion. The owner of COMPAS described what factors are used in the risk assessment, but not how those factors are weighted or how the scales used based on that information are created (Dieterich, W., Mendoza, C, & Brennan, T., 2016). Rudin, Wang, and Coker’s findings run contrary to claims made by the creators of this instrument, who have not made the source code or underlying data available to substantiate claims about its design or functioning. The findings also run contrary to a ProPublica analysis, which found “significant racial disparities” in COMPAS scores, based on analysis of data from 7,000 cases in Broward County, Florida, in 2013–2014 (Angwin, Larson, Mattu, & Kirchner, 2016). Thus, the authors both defend COMPAS from one charge, regarding significant racial disparities, and at the same time, in analyzing Broward County data, they shed light on deeper flaws in COMPAS.
In our federal law, the concern is not with the term ‘fairness,’ but also with the potential for unconstitutional and/or illegal discrimination (Starr, 2014). As the authors suggest, COMPAS does not “necessarily” depend on race, and it likely does not actually use race as a factor. Doing so might raise serious constitutional concerns. However, if an instrument depends on a factor, such as criminal history, that can be a proxy for race, there may be real reasons to be concerned about the disparate racial effects of the use of that instrument. The authors note criminal history can be a proxy for race. They find that the general and violence-predicting COMPAS scores do not heavily rely on criminal history. While these findings suggest the COMPAS system may only weakly correlate with race, the reason why, the authors explain, is the nonlinear role that age plays. The authors explain that “in the Broward County data, African American people tend to be disproportionately represented at younger ages than Caucasians.” This may reflect a pattern of racially biased policing in Broward County, Florida. We do not know more, including because COMPAS is a black box, but the finding does raise the concern that that predictive power of a risk instrument can be distorted by police practices. The risk assessment instrument may not magnify that preexisting bias, but it may obscure the source of it.
The authors also explain why the COMPAS appears to be both deceptively complex and poorly designed. The authors describe how COMPAS may rely on age at time of arrest in a way that is not linear, as the designers claim. How COMPAS uses age is an important question; COMPAS appears to heavily depend upon age in its risk recommendations. Megan Stevenson and Christopher Slobogin recently describe, in another analysis of COMPAS, that roughly 60% of the risk scores it produces are attributable to one factor: age (Stevenson & Slobogin, 2019). Similarly, Rudin, Wang, and Coker describe how criminal history is only weakly predictive of COMPAS scores. If COMPAS largely relies on age, then perhaps it is not relying on factors that proxy race. Yet it relies on 137 factors, which require a great deal of data entry. It is possible that one of those factors (age) is doing much of the work, with much of the rest of the data collected and retained by a private corporation, without contributing meaningfully to the predictive task. The authors ask, “If COMPAS does not depend heavily on most of the 137 variables, including the proxies for socioeconomic status,” then why collect such “private information”? Of course, all sorts of private information is routinely collected in the criminal justice system, and then made public in criminal records. Privacy interests are attenuated in that context. We should be asking more searching questions about why such information is collected, what is done with it, and whether such data should be shared with private corporations, or made public.
Data entry problems are another important subject of Rudin, Wang, and Coker’s work. They emphasize how in the context of risk assessment, as in many other contexts, the more complex the data to be collected and entered, the more room there is for human error. The authors identify all sorts of mistakes in Broward County risk assessment scoring. The authors note that: “[t]he unnecessary complexity of COMPAS could cause injustice to those who had typos in the COMPAS computation, and as a result, were given extra long sentences.” When these mistakes are made in entering data, because the COMPAS system is not transparent, a judge or a defendant cannot readily correct the error. No one is told that an error occurs, and if an error is brought to a judge’s attention, a judge cannot know whether the error played a role in a risk recommendation. Further, in a pretrial setting in which decisions are made quickly, with limited information and often without meaningful defense representation, by the time an error is caught it may be too late to remedy the harm caused by an improper detention decision.
We also do not know, in Broward County, or in other counties in which it is used, to what degree judges adhere to the recommendations made by the risk assessment instrument. In studies of the use of risk assessment in sentencing in Virginia, a different context and one involving a public and not-proprietary instrument, John Monahan, Alex Jakubow, Anne Metz, and I have described how judges often do not follow risk assessment recommendations, where doing so can divert offenders to alternative sentences outside of prison (Garrett, Jakubow, & Monahan, 2019; Garrett & Monahan, in press; Monahan, Metz, & Garrett, 2018). Most often, judges diverted persons to jail from prison, and judges sometimes diverted high- and not low-risk persons. Some judges described how they were skeptical of quantitative risk information, while others described lack of available treatment resources in community. How decision makers like judges interact with risk information is a separate and complicated question. What we can be sure of is that judges do not necessarily hew closely to quantitative information designed to inform their discretion.
To date, the main legal challenge to COMPAS, brought by a criminal defendant in Wisconsin, raised the transparency problem as a due process claim. The discretionary feature of the risk information played an important role in the challenge. The defendant argued that it violated his right to due process to have an individualized sentence and to not be sentenced based on inaccurate information and based on gender. In its Loomis decision in 2016, the Wisconsin Supreme Court rejected the claims, emphasizing that the risk assessment information was only advisory, and judges have discretion to consider or discount the recommendation provided by the instrument (State v. Loomis, 2016). Yet, with the algorithm treated as a trade secret, the judge does not understand how the risk score is calculated any better than the defense. The Court did hold that judges must be given written advisories concerning the risk assessment instrument, including that it relies on only group data, that studies have raised questions about racial bias in its classification, and that the instrument was not validated in Wisconsin (State v. Loomis, 2016). Future courts will hopefully scrutinize COMPAS still more carefully.
There are several levels of transparency needed to assure adequate vetting of algorithms in criminal justice. Fortunately, COMPAS is an outlier in the area of risk assessment and decision makers seem to be electing to use more public and transparent (and free) options. Most crimogenic risk assessments in use are in fact quite simple, relying on largely static factors (like age and criminal history), and the underlying instruments are made available. In that sense, they may be transparent. The underlying validation data may not be as easily available, however, and as noted, there can be real questions concerning how the risk assessment instruments are used by decision makers in practice. Take, for example, the federal First Steps Act, which calls for a risk assessment instrument to be developed and used for all eligible federal prison inmates (Garrett, 2020). The act calls for an independent research committee to develop the instrument, for its design to be made public, and for data concerning its use to be evaluated every 2 years. That act sets a model for a far more transparent approach than used in many jurisdictions. That said, the validation data used to design the instrument, the PATTERN, developed by the research committee, was not made public, such that researchers cannot easily evaluate the claims made regarding how it would function (Garrett & Stevenson, 2019). Thus, many risk assessment instruments are not black boxes or secret, but that does not mean that there is not important empirical work to do to better understand and evaluate their use. What may be needed is something more along the lines of the act, to require validation and analysis. Indeed, it would be far preferable if researchers like Rudin, Wang, and Coker received the underlying validation data and implementation data for risk assessments, so private and independent researchers can assess the use of such algorithms. Further, defense access is critical, so that criminal defendants can meaningfully access evidence, understand it, and litigate it. We need open science for risk assessment.
In contrast to the area of risk assessment, in which most instruments used are made public in part, in a wide array of nonrisk assessment areas, algorithms and evidence are used in secret, treated as propriety, and not disclosed to criminal defendants. For years now, litigants have, mostly without success, sought to obtain data concerning a wide range of evidence used in criminal cases. Algorithms are used to analyze biometric data, such as fingerprints, and the crime labs that rely on these database searches often do not themselves have access to the information concerning how those searches are conducted. Complex DNA mixtures are analyzed using algorithms marketed by companies that have not shared source code, except under very narrow circumstances, despite the use of the evidence to support criminal convictions. Each of these forensic uses raises important questions of accuracy, fairness, and constitutional rights of criminal defendants, and yet these uses of algorithms have largely been permitted without any judicial remedy.
The assumption that transparency and validation will cure unfairness is unfortunately overly optimistic. Indeed, without routine regulatory oversight and defense access, and even with it, there may be no reform in the use of such algorithms. COMPAS has been the subject of inquiry, despite challenges of studying a black box instrument, including because it is a high-profile setting in which algorithms have been used in sentencing. Similar work is lacking in a wide range of other areas, such as DNA mixtures, facial recognition, and fingerprint databases. Error rates are a serious concern across each of those areas. Error rates have been documented across a wide range of forensic techniques, from DNA testing to fingerprinting. In a criminal trial in New York, the judge ultimately excluded results from two competing DNA mixture algorithms, STRMix, which was developed by the FBI, and TrueAllele, created by a commercial developer, which reached opposite results (McKinley, 2016).
Errors can occur for a range of reasons, including through the use of a technique by a human examiner (Ulery, Hicklin, Buscaglia, & Roberts, 2011), due to the influence of cognitive bias on an examiner (Dror & Hampikian, 2011), uncertainty inherent in measurement, and through the operation of algorithms. Further, errors can include false positive ‘match’ conclusions but also false negative failures to include, and false determinations that evidence is inconclusive or not suitable for examination (Ulery et al., 2011). If the error rates are acceptable, the evidence can be used, depending on the purpose to which it is put, and so long as error rates are appropriately disclosed to jurors and judges. Yet in some areas, error rates have been documented that are so high that scientific bodies have called for an immediate halt to the use of such unreliable evidence in court (President’s Council of Advisors on Science and Technology Report, 2016). For the most part, however, those forensic techniques (such as bite mark comparisons) are still being used in criminal cases and in court.
To be sure, private companies can take these concerns into account, and they should be applauded when they do so. In 2019, the largest maker of police body cameras, Axon, convened an independent ethics board to examine whether it should use facial recognition technology. The report recommended, and the company declined to use, the technology at present due to both privacy and accuracy concerns: “Even if face recognition works accurately and equitably—and we stress in detail that at present it does not—the technology makes it far easier for government entities to surveil citizens and potentially intrude into their lives” (Axon AI and Policing Technology Ethics Board, 2019).
Unfortunately, when both government and private companies do not take such considerations into account, there is little in place to prevent a transparent and known-to-be patently unfair and inaccurate algorithm to be used in criminal cases. The American Statistical Association (2019) recommended that for any type of forensic evidence used in court, the expert “should report the limitations and uncertainty associated with measurements, and the inferences that could be drawn from them.” To date, due process and other legal challenges seeking to ensure that such information is presented have mostly not been successful. The lesson that simple and interpretable risk assessments can be used, rather than an error-prone, complex, and inscrutable risk system, has been lost on many criminal justice decision makers. Perhaps national- and state-level legislation like the algorithmic fairness act is needed. We need open science in criminal justice.
Read invited commentary by:
Shawn Bushway (The RAND Corporation)
Alexandra Chouldechova (Carnegie Mellon University)
Sarah Desmarais (North Carolina State University)
Eugenie Jackson and Christina Mendoza (Northpointe, Inc.)
Greg Ridgeway (University of Pennsylvania)
Read a rejoinder by: Cynthia Rudin, Caroline Wang, and Beau Coker
American Statistical Association. (2019). Position on statistical statements for forensic evidence. Retrieved from https://www.amstat.org/asa/files/pdfs/POL-ForensicScience.pdf
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica. Retrieved from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Axon AI and Policing Technology Ethics Board. (2019). First report. Retrieved from https://static1.squarespace.com/static/58a33e881b631bc60d4f8b31/t/5d13d7e1990c4f00014c0aeb/1561581540954/Axon_Ethics_Board_First_Report.pdfhttps:/www.policingproject.org/axon
Dieterich, W., Mendoza, C, & Brennan, T. (2016). COMPAS Risk Scales: Demonstrating accuracy and predictive parity, performance of the COMPAS Risk Scales in Broward County (pp. 5–6). Northpointe Inc., Research Department.
Dror, I. E., & Hampikian, G. (2011). Subjectivity and bias in forensic DNA mixture interpretation. Science & Justice, 51, 204–208. https://doi.org/10.1016/j.scijus.2011.08.004
Garrett, B. L. (2020). Federal criminal risk assessment. Cardozo Law Review, 41, 121.
Garrett, B. L., Jakubow, A., & Monahan J., (2019). Judicial reliance on risk assessment in sentencing drug and property offenders: A test of the treatment resource hypothesis. Criminal Justice and Behavior, 46, 799–810. https://doi.org/10.1177/0093854819842589
Garrett, B. L., & Monahan J. (in press). Judging risk. California Law Review, 108.
Garrett, B. L., & Stevenson, M. (2019). Comment on PATTERN. Retrieved from https://sites.law.duke.edu/justsciencelab/2019/09/15/comment-on-pattern-by-brandon-l-garrett-megan-t-stevenson/
Justice in Forensic Algorithms Act of 2019. (2019). H.R.4368 https://www.congress.gov/bill/116th-congress/house-bill/4368/text?r=2&s=1
McKinley, J. (July 25, 2016). Potsdam boy’s murder case may hinge on minuscule DNA sample from fingernail. New York Times.
Monahan, J., Metz, A., & Garrett, B. L. (2018). Judicial appraisals of risk assessment in sentencing. Behavioral Sciences and the Law, 36, 565–575.
President’s Council of Advisors on Science and Technology Report. (2016). Forensic science in criminal courts: Ensuring scientific validity of feature-comparison methods. White House Office of Science Technology, Washington, D.C.
Rudin, C., Wang, C., & Coker, B. (2020). The age of secrecy and unfairness in recidivism prediction. Harvard Data Science Review, 2(1).
Starr, S. (2014). Evidence-based sentencing and the scientific rationalization of discrimination. Stanford Law Review, 66, 803–872.
State v. Loomis, 881 N.W.2d2d 749 (Wis. 2016)
Stevenson, M., & Slobogin, C. (2019). Algorithmic risk assessments and the double‐edged sword of youth. Behavioral Science and Law, 36, 635–656.
Ulery, B.T., Hicklin, R.A., Buscaglia, J., & Roberts, M.A. 2011. Accuracy and reliability of forensic latent fingerprint decisions. Proceedings of the National Academy of Sciences, 108, 7733–7738. https://doi.org/10.1073/pnas.1018707108
This article is © 2020 by Brandon L. Garrett. The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the author identified above.