How Should We Think about Racial Disparities?
In a previous post, I considered some of the less convincing critiques of pretrial and sentencing risk assessments that sound in the ecological fallacy. The fallacy argument mistakenly targets risk scores as applying only group inferences to individual case decision-making. The takeaway was straightforward. A comprehensive understanding of actuarial tools must include rigorous counterfactual thinking about a state of the world in which they aren’t available. In this follow-up, I discuss an even more serious claim: that actuarial tools might lead to unjustifiable racial disparities in criminal justice outcomes.
The ProPublica piece to which I linked before focuses on the troubling implications of racial imbalances in scores and predictive accuracy. The article’s opening vignettes compare a black teenage defendant with no prior record who stole a bicycle with a middle-aged white male who stole hardware from a Home Depot. Importantly, he had prior armed robbery convictions, whereas she had no record. The proprietary scoring algorithm known as COMPAS deemed the young girl a high-risk individual and her older counterpart a low-risk one. And yet: “Two years later, we know the computer algorithm got it exactly backward. [The girl] has not been charged with any new crimes. [The man] is serving an eight-year prison term for subsequently breaking into a warehouse and stealing thousands of dollars’ worth of electronics.” (emphasis added) Errors of this sort–what statisticians call Type I and Type II, respectively–deserve further scrutiny. They are, after all, merely anecdata.
Four researchers set out to crunch the numbers and published both an academic study and the general-audience piece. Using data from Broward County, FL, their initial analysis showed that black and white defendants received different COMPAS scores, even adjusting for other defendant characteristics. (Click on the image to enlarge.) A number of reasons, varying in legitimacy, might explain the differential. For example, COMPAS might rely on inputs that themselves are tainted by systemic racism. But that does not mean that the risk assessment is biased. Justifying that claim requires comparing recidivism outcomes by race, conditional on the risk score assigned. For example, if white and black defendants assigned to a moderate risk score recommit crimes at the same rate, we would say the mechanism does not introduce racial bias.
The study authors then turn to these more relevant conclusions. The observed that COMPAS “correctly predicted an offender’s recidivism 61 percent of the time, but was only correct in its predictions of violent recidivism 20 percent of the time” and that it “correctly predicted recidivism for black and white defendants at roughly the same rate (59 percent for white defendants, and 63 percent for black defendants) but made mistakes in very different ways. It misclassifies the white and black defendants differently when examined over a two-year follow-up period.” Results based on simple comparison tests by race were robust to covariates such as criminal history, age, and gender. But are they robust to more exacting interpretations?
Take the ProPublica contingency tables. These numbers ostensibly reveal the extent of Type I and Type II errors in a meaningful way. The mismatch story ensues because high-risk black defendants who did not fail were much more likely to be labeled high-risk than white defendants; the opposite was true for low-risk designations. But as Jennifer Doleac and Megan Stevenson point out, these frequency tabulations carry only limited value. Why? If underlying recidivism rates differ by race, then these ratios necessarily will be misleading.
The reason is purely mathematical. As Doleac & Stevenson remind us, “[i]n a group with high recidivism rates, the numerator will be larger because the pool of people labeled high risk is bigger and the denominator will be smaller because there are fewer people who do not reoffend. The result is that the ratio of these numbers is always larger than it is for low-recidivism groups.” (emphasis added) Much like the inherent problems with reporting odds ratios in empirical work, reporting false positive/negative rates can obscure the issue and impede interpretation. Christopher Lowenkamp and his colleagues adopted a better approach. In part, they ran regressions using variables that reflect precisely the test I mentioned above: “interaction terms between an individual’s race and the [COMPAS] decile score.” Using this framework, Lowenkamp et al. found “no significant differences in . . . relationship between the COMPAS and general recidivism for White and Black defendants. A given COMPAS score translates into roughly the same likelihood of recidivism, whether a defendant is Black or White.” (emphasis added)
I do not mean to suggest–far from it–that we should ignore differential sentencing outcomes by race. (The same is true for all prior procedural decisions in a criminal case.) It is important, though, to distinguish differential outcomes (e.g., different risk assessment scores) by race from bias on account of race. The two are not necessarily equivalent. In addition, measuring bias itself depends on the counterfactual reference point that the researcher or policymaker identifies for comparative purposes. As with the need for counterfactual thinking, puzzling through the relationship between race and risk assessments raises tough questions. The Lab will continue to ask them through our PSA field work and hopefully generate useful evidence in response.