Or How We Should Be Thinking Counterfactually About Actuarial Tools
Nate Silver’s widely heralded FiveThirtyEight.com site now tracks more than just presidential elections. He and his colleagues apply statistical modeling or reasoning to everything from the Emmys to ERAs. Just over a year ago, its contributors–in collaboration with the Marshall Project (which itself is funded by the A2J Lab’s sponsor, the Arnold Foundation)–released a feature on the use of pretrial and sentencing risk assessments. So, too, did investigative journalists at ProPublica.
Both pieces raised serious questions about the use of risk scoring mechanisms. Should officials base decisions about individual arrestees or convicted defendants on aggregate data from other cases? Is there any evidence that these tools are racially biased? Former Attorney General Eric Holder previously voiced those concerns, best captured in his statement: “Although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice. . . . [T]hey may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.”
I will address these issues in two posts; here I take up the question of individualized determination in the context of what scholars call the “ecological fallacy.” The concept has nothing to do with who’s right in the climate change debate. Rather, it refers to a practice in which one derives inference about individuals from inference about the group or class to which the individual belongs. (The concept was famously discussed in, among other legal decisions, the landmark Supreme Court decision McCleskey v. Kemp.) Critics of pretrial risk assessment tools argue a version of the ecological fallacy critique. The gist is that an individual arrestee’s or defendant’s fate turns on conclusions about others who pose ostensibly similar risk, not on this individual’s actual propensity to fail. It’s a generalization seemingly at odds with, as General Holder put it, “individualized and equal justice.”
The broadsides in the FiveThirtyEight piece that roughly correspond to the ecological fallacy are:
“These instruments aren’t about getting judges to individually analyze life circumstances of a defendant and their particular risk. It’s entirely based on statistical generalizations.”
“Statistics, after all, can’t say whether [a defendant] will commit another crime, and he believes he’s doing everything possible to avoid further run-ins with the law.
First, and most important, these objections mask a critical question about the role that actuarial risk instruments play in judicial decision-making. Specifically, they fail to consider the counterfactual: what would magistrates and trial court judges do without the risk assessment tool? The report only nods in that direction. (“Formal risk assessments offer greater transparency and, according to numerous studies, greater accuracy than the ad hoc systems they are replacing.”) Acknowledging the counterfactual could have been made much more clearly. Judges who do not receive any guidance must rely on their own generalizations, which can be tainted by cognitive bias and misperception. (See my earlier post for a brief discussion.) Would we seriously prefer not to provide scientifically validated recommendations simply because they derive from large-N statistical studies?
Second, it seems fairly comical to fear risk scores because (1) they can’t “guarantee a probation officer won’t give a kid a higher risk score because he thinks the kid wears his pants too low”; or (2) “[o]fficial records can contain mistakes.” If we really thought counterfactually, we might query just how bad outcomes could be without (1) any correction for race and class bias we know must exist; or (2) unguided human decision-making that also uses error-riddled official records.
Finally, it would make more sense to argue the ecological fallacy if judges were simply given generalized statistics about persons whose criminal records (or, more controversially, demographic characteristics) match an arrestee’s. In practice, scientifically backed risk assessment tools use the wealth of knowledge accumulated in past data to draw (hopefully) valid inference about a particular current case. We do this all the time, whether it be graduate admission tests or experience rating insurance. Moreover, risk assessment tools provide recommendations that guide or supplement human reasoning. The resulting scores never should and are not designed to supplant such reasoning.
I have come here to reframe the criticisms that skeptics of risk score tools have lodged, not to bury them. We at the A2J Lab harbor our own reservations about how courts use these mechanisms. But those worries mostly stem from deploying risk assessments at any stage of a criminal proceeding to which it was never designed to apply. I also would be troubled by risk scores that use bad inputs, i.e., those that are not scientifically valid. It seems premature and misguided to write off risk assessments for many of the reasons encountered in the FiveThirtyEight feature. Instead, I agree with Jennifer Doleac & Megan Stevenson, who write:
An ideal approach to answering this question would be an experiment in which some judges are randomly assigned to use risk assessments as part of their decisions (their defendants are the treatment group), and some judges to operate as before (their defendants are the control group). . . . We could use [this] method to consider this policy tool’s effects on recidivism, incarceration rates, and any other outcomes we care about.
As it so happens, that’s precisely what we are doing in Dane County, Wisconsin.