Today on Twitter there was some chatting about one-sided p-values. Daniel Lakens thinks that by 2018 we’ll see a renaissance of one-sided p-values due to the advent of preregistration. There was a great conversation that followed Daniel’s tweet, so go click the link above and read it and we’ll pick this back up once you do.

Okay.

As you have seen, and is typical of discussions around p-values in general, the question of evidence arises. How do one-sided p-values relate to two-sided p-values as measures of statistical evidence? In this post I will argue that thinking through the logic of one-sided p-values highlights a true illogic of significance testing. This example is largely adapted from Royall’s 1997 book.

### The setup

The idea behind Fisher’s significance tests goes something like this. We have a hypothesis that we wish to find evidence against. If the evidence is strong enough then we can reject this hypothesis. I will use the binomial example because it lends itself to good storytelling, but this works for any test.

Premise A: Say I wish to determine if my coin is unfair. That is, I want to reject the hypothesis, H1, that the probability of heads is *equal to* ½. This is a standard two-sided test. If I flip my coin a few times and observe *x* heads, I can reject H1 (at level α) if the probability of obtaining *x or more* heads is less than α/2. If my α is set to the standard level, .05, then I can reject H1 if Pr(*x or more *heads) ≤ .025. In this framework, I have strong evidence that the probability of heads is not equal to ½ if my p-value is lower than .025. That is, I can claim (at level α) that the probability of heads is either greater than ½ or less than ½ **(proposition A)**.

Premise B: If I have some reason to think the coin might be biased one way or the other, say there is a kid on the block with a coin biased to come up heads more often than not, then I might want to use a one-sided test. In this test, the hypothesis to be rejected, H2, is that the probability of heads is *less than or equal to *½. In this case I can reject H2 (at level α) if the probability of obtaining *x or more* heads is less than α. If my α is set to the standard level again, .05, then I can reject H2 if Pr(*x or more *heads) < .05. Now I have strong evidence that the probability of heads is not equal to ½, nor is it less than ½, if my p-value is less than .05. That is, I can claim (again at level α) that the probability of heads is greater than ½. **(proposition B)**.

As you can see, proposition B is a stronger logical claim than proposition A. Saying that my car is faster than your car is making a stronger claim than saying that my car is *either* faster *or* slower than your car.

### The paradox

If I obtain a result *x, *such that α/2 < Pr(*x or more *heads) < α, (e.g., .025 < p < .05), then I have strong evidence for the conclusion that the probability of heads is greater than ½ (see proposition B). But at the same time I do not have strong evidence for the conclusion that the probability of heads is > ½ *or *< ½ (see proposition A).

I have defied the rules of logic. I have concluded the stronger proposition, probability of heads > ½, but I cannot conclude the weaker proposition, probability of heads > ½ *or* < ½. As Royall (1997, p. 77) would say, if the evidence justifies the conclusion that the probability of heads is greater than ½ then surely it justifies the weaker conclusion that the probability of heads is either > ½ or < ½.

### Should we use one-sided p-values?

Go ahead, I can’t stop you. But be aware that if you try to interpret p-values, either one- or two-sided, as measures of *statistical (logical) evidence* then you may find yourself in a p-value paradox.

### References and further reading:

Royall, R. (1997). *Statistical evidence: A likelihood paradigm* (Vol. 71). CRC press. Chapter 3.7.

One-sided p values are definitely tricky. You allocate zero probability to the result in the non-predicted direction and thus merely having a prediction or expectation of a directional effect is not sufficient to justify their use. You have to be certain a priori that a result in the non-predicted direction should be ignored or discounted. This rarely makes sense – even with pre-registration. This is another area where Bayesian approaches seem more appealing – as you are in essence placing a prior on the non-predicted results (but generally a fairly implausible one) it makes sense to use Bayesian machinery for this.

Abelson (1995) proposed a one and a half tailed test which is more reasonable: having tail probabilities of 1% + 4% in the non-predicted and predicted directions …

Hi Alex,

I think you might be interested in my resolution of the so-called paradox posed by Royall and others, here: https://www.onesided.org/articles/the-paradox-of-one-sided-v-two-sided-tests-of-significance.php . I first take a stab at the generic form and then specifically address Royall’s formulation of it under “Solving Royall’s version of the paradox”. The subsection is pretty much a contained argument so no need to read the whole article if the rest is of no interest.

Would love to hear your thoughts on it!

Thanks,

Georgi