What’s wrong with [null-hypothesis significance testing]? Well… it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! (Cohen 1994, pg 997)

That quote was written by Jacob Cohen in 1994.What does it mean? Let’s start from the top.

A null-hypothesis significance test (NHST) is a statistical test in which one wishes to test a research hypothesis. For example, say I hypothesize that practicing improves performance (makes you faster) when building a specific lego set. So I go out and collect some data to see how much people improve on average from a pretest to a post test- one group with no practice (control group) and another group with practice (experimental group). I end up finding that people improve by five minutes when they practice and they don’t improve when they don’t practice. That seems to support my hypothesis that practice leads to improvement!

Typically, however, in my field (psychology) one does not simply test their research hypothesis directly, first one sets up a null-hypothesis (i.e., H0, typically the opposite of their real hypothesis: e.g., no effect, no difference between means, etc.) and collects data trying to show that the null-hypothesis isn’t true. To test my hypothesis using NHST, I would first have to imagine that I’m in a fictitious world where practicing on this measure doesn’t actually improve performance (H0 = no difference in improvement between groups). Then I calculate the likelihood of finding results at least as extreme as the ones i found. If the chance of finding results at least as extreme as mine is less than 5%, we reject the null-hypothesis and say it is unlikely to be true.

In other words, I calculate the probability of finding a difference of improvement between groups of at least 5 minutes on my lego building task- remember, in a world where practicing doesn’t make you better and the groups improvements aren’t different- and I find that my probability (p-value) is 1%. Wow! That’s pretty good. Definitely less than 5% so I can reject the null-hypothesis of no improvement when people practice.

But what do I really learn from a significance test? A p-value only tells me the chance that I should find data like mine in a hypothetical world, a world that I don’t think is true, and I don’t want to be true. Then when I find data that seem unlikely in a world where H0 is true, I conclude that it likely isn’t true. The logic of the argument is thus:

If H0 is true, then this result (statistical significance) would probably not occur.

This result has occurred.

Then H0 is probably not true [….] (Cohen, 1994 pg 998)

So: if it’s unlikely to find data like mine in a world where H0 is true, then it is unlikely that the null-hypothesis is true. We want to say is how likely our null-hypothesis is by looking at our data. That’s inverse reasoning though. We don’t have any information about the likelihood of H0, we just did an experiment where we* pretended that it was true*! How can our results from a world in which H0 is true provide evidence that it isn’t true? It’s already assumed to be true in our calculations! We only make the decision to reject H0 because one day we arbitrarily decided that our cut-off was 5%, and anything smaller than that means we don’t believe H0 true.

Maybe this will make it more clear why that reasoning is bad:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

This person is a member of Congress.

Therefore, he is probably not an American. (ibid)

That’s the same logical structure that the null-hypothesis test takes. Obviously incoherent when we put it like that right?

This problem arises because we want to say “it is unlikely that the null-hypothesis is true,” but what we really say with a p-value is, “it is unlikely to *find this extreme of data* when the null-hypothesis is true.” Those are very different statements. One gives a likelihood of a hypothesis given a data set, P( Hypothesis | Data) and the other gives a likelihood of data given a hypothesis, P( Data | Hypothesis). No matter how much we wish for it to be true, the two probabilities are not the same. They’re never going to be the same. P-values will never tell us what we want them to tell us. We should stop pretending they do and we should acknowledge the limited inferential ability of our NHST.

Thanks for reading, comment if you’d like.