Earlier this year, the Food and Brand Lab at Cornell University was caught up in a scandal. The alleged transgression? P-hacking.
If you have no idea what that means, this post is for you.
The papers from our completed studies all use a metric called a “p-value” in the results. A p-value (“p” for probability) is a key component of modern statistical analysis. But what exactly is a p-value?
Speaking generally, a p-value is the probability that we would have seen the results we did see (or results even more extreme) if the treated and control groups really were the same.
In other words, we start with the premise that the treated and control groups are the same. That’s our “null hypothesis.” Then we get data. If the data show differences between the treated and control groups, those differences could be due to one of two things: (i) luck, or (ii) the treated and control groups are actually different (meaning our null hypothesis is likely false). The p-value tells us how likely it is that, if the treated and control groups were really the same, luck would have given us the differences that we saw in our data (or differences more extreme). So if the p-value is high, we conclude that the differences we saw are likely due to luck, so we have no evidence that the treated and control groups are different. If the p-value is low, we conclude instead that luck likely did not cause the differences we saw, meaning that treated and control groups likely are different.
If that’s a p-value, what’s p-hacking? Having p-values lower than .05 is almost a requirement for getting papers published in scientific journals. Since publication is necessary for academics, the p-value can be the key to promotion and accolade—or the barrier to professional success.
Because scientists have some ability to control the p-value, this high importance can create a temptation to manipulate data. P-hacking is the term for presenting data in a way that artificially lowers the p-value (by, say, cherry picking data). Audiences have to rely on the replication of study results, and, in the absence of subsequent studies, researcher integrity, to ensure the actual scientific relevance that we infer from a low p-value. The combination of p-hacking with a frequent lack of study replication means that we’re all vulnerable to being swayed by studies that don’t actually demonstrate measurable impact.
Cornell isn’t the only institution responding to debates around p-values. Over the past several years, many disciplines have found themselves in a replicability crisis. Scholars can’t get the same results by repeating studies done by their peers. This failure calls into question the validity of the original analysis, including whether or not the p-values are a good measure of whether or not a result would occur by chance. Serious scholars are seeking sustainable solutions to these issues. One proposal is to change the p-value threshold to .005 (rather than .05).
What’s the best way to improve the replicability of scientific findings? Share your ideas in the comments section.
If you’re looking for a more in-depth description of p-values, NOVA has a great introductory video that explains the history and utility of the p-value, as well as the temptations and dangers of p-hacking. (It’s four minutes with good captions; you can watch it on mute.)