Illustrations and Studies Are Not the Same

Recently, we became aware of at least two blog posts (see here and here) lifting passages from, and selectively highlighting a result of, a draft paper the five of us authored. These posts provide a distorted picture of what our paper said and did. The paper is not intended as an analysis of the Public Safety Assessment-Decision Making Framework (“PSA-DMF”) System risk assessment, nor is the data the paper analyzes anywhere close to final. Rather, the paper proposes new statistical methodology. It uses, for illustration purposes, partial data made available to us in the middle of a still-ongoing field experiment of the PSA-DMF System. The study that produced the partial, interim data used to illustrate our statistical methods is still ongoing and our illustration uses less than 20% (in a rough sense) of the information the study will ultimately produce. An interim report on the study, which has been public for some time now, makes all of this clear. Moreover, there were many results from the paper’s illustrative application, only one of which the blog posts highlight.

We are scientists. We want our work, both our new methodologies and our applied findings, to inform public debates and decision making. But it is not helpful to mistake illustrative applications based on partial interim data in a paper intended to propose new statistical methods for final study results, nor to highlight selectively only certain results of an overall analysis. We hope this post will clarify mistaken impressions.

Here are some details.

Our paper proposes new statistical methodology for evaluating risk assessment instruments and uses, as its applied example, interim data from a still-ongoing randomized control trial (“RCT”) study. The RCT study evaluates the use of a predisposition risk assessment instrument called the Public Safety Assessment (“PSA”) and the accompanying, jurisdiction-specific, Decision Making Framework (“DMF”) in Dane County, WI. Arnold Ventures supported the development of the PSA-DMF System.

The two blog posts purport to make much of one of the several results stemming from the illustration of our new methodology via the application of our methods to interim data from the Dane County study. The results of our illustration are many and varied. Here is a sampling of some these varied results as applied to the interim data from this one RCT:

The availability of the PSA-DMF System had no statistically significant effect on the prevalence of predisposition new criminal activity (“NCA”) in any of three conceptually defined classes of individuals appearing at a first appearance hearing.
The availability of the PSA-DMF System had no statistically significant effect on the prevalence of predisposition new violent criminal activity (“NVCA”) in any of three conceptually defined classes of individuals appearing at a first appearance hearing.
The availability of the PSA-DMF System had no statistically significant effect on the prevalence of predisposition failure to appear (“FTA”) in any of three conceptually defined classes of individuals appearing at a first appearance hearing.
The availability of the PSA-DMF System had no statistically significant effect on the measure of the racial fairness of the Dane County judges (actually, “Commissioners” in Dane) that two authors of our paper proposed in separate work, called “principal fairness.
The availability of the PSA-DMF System had a statistically significant effect on the principal fairness measure with respect to gender comparisons, in that it increased the strictness of Commissioner decisions for men while decreasing the corresponding strictness for women, thus widening somewhat an already-existing gender difference in those decisions.

The two blog posts seize upon the last of these results (ignoring almost all of the others) to articulate an attack on the PSA-DMF System and upon the use of algorithms or risk classification instruments in criminal justice more generally.

Advocates for a particular position often simplify, distort, and selectively quote. In a democracy such as ours, committed to free expression of ideas, it is not a mistake for them to do so. The mistake is for anyone else to pay attention. Good decision making in a democracy requires readers to distinguish between what is worthy of attention and belief and what is not.

One indication of whether a report on research deserves attention is whether the report’s authors contacted the relevant researchers before publishing, to request comment or clarification about the nature of the research. To our knowledge, the authors of neither blog post did here. Had they done so, we would have been happy to bring certain facts to their attention. First, the paper they quote is about statistical methodology, as a cursory review of it reveals. It is quite nerdy. The “results” in the paper are intended to illustrate how the statistical methodology works on a dataset. They are not intended to form the basis of conclusions relevant for policy. That is why, for example, we made no attempt in the paper to adjust for the fact that we conducted multiple tests on the same data. Including an illustrative application in a paper proposing new statistical techniques is traditional in the field, and again, the results of an illustration are not intended to form the basis of policy making. Second, as a lengthy report and a more accessible FAQ sheet (both publicly available on the website of the Access to Justice Lab, where two of us work) make clear, the data used for the paper were an incomplete subset of about 20% of the data the Dane County RCT will eventually produce. Moreover, the Dane County RCT is one of five field RCT studies the A2J Lab has underway (a sixth RCT field operation will produce a technical report only due to IRB restrictions). In a rough, hand-wavy sense, we are talking about one-fifth of one sixth of the studies underway in this area. Third, as the report and the FAQ sheet both discuss, based on the data produced from the Dane County study at this time, the availability of the PSA-DMF System had no statistically significant effect on the number of predisposition days that individuals appearing at a first appearance hearing spent incarcerated, suggesting (although not proving) that any disparity in the strictness or permissiveness of Commissioner decisions (which is what the blog posts focus on) did not translate into a difference in predisposition jail time.

Stepping back, it is not accurate to say that any statistically significant difference between two groups immediately means actionable discrimination. Even taking the illustrative gender result upon which the two blog posts focus as the final word (it isn’t) from the only field RCT (it isn’t) on the only risk assessment instrument used in criminal law (not even close), one should be cautious. Gender disparities have long been present in criminal justice, with women generally being arrested with less frequency than men and receiving more lenient treatment from the court system. Saying that gender disparities exist, or even increase, because of some intervention requires careful thought about whether we want to do anything about those disparities, and if so, what we might want to do. In this case, suppose the presence of the PSA-DMF System increased gender disparities by increasing the leniency of the criminal justice system’s treatment of women vis-à-vis similarly situated men (as appears to be partially the case from our illustrative result). One way the criminal justice system could solve that “problem” would be by treating women more like men, i.e., treating women more strictly/harshly. Is that what we want?

Good science is slow, sometimes maddeningly so. Credible research takes time. Useful inference requires careful attention to context, and policy decisions require the weighing of alternatives. It is not helpful to any of these processes for those with a political ax to grind to present results designed to illustrate the application of new statistical methods, results based on a subset of preliminary data from one of several ongoing studies, as though those results were the last word on anything, including that one study.

One final thought: The Access to Justice Lab, where two of us work and which conducted the Dane County field operation, is supported by Arnold Ventures, and the Dane County study itself is supported by Arnold Ventures. That said, all five of us are agnostic at this stage about whether any criminal justice risk assessment instrument, including the PSA-DMF System, is a good or a bad thing. In our view, credible evidence one way or the other does not yet exist. What little that does exist suggests to us that the sturm und drang about risk assessments, from proponents and opponents, may be overblown. Moreover, what one thinks about whether it is a good or bad idea to use risk assessment instruments of any kind should turn in huge part about what one would do if risk assessments are not used. The most common alternative to the use of risk assessments, in criminal justice at least, is unguided, or loosely guided, or less guided human decision making. And those opposed to the use of risk assessments in criminal justice apparently prefer these unguided human decisions. The United States has had decades of experience with unguided human decision making in its criminal justice systems. How has that gone?

James Greiner
Ryan Halen
Kosuke Imai
Zhichao Jiang
Sooahn Shin

* Note: authors are listed in alphabetical order by last name

Share this:

Related