# Sunday Bayes: Testing precise hypotheses

First and foremost, when testing precise hypotheses, formal use of P-values should be abandoned. Almost anything will give a better indication of the evidence provided by the data against Ho.

### Sunday Bayes series intro:

After the great response to the eight easy steps paper we posted, I started a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format is short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next. This paper was voted to be the next in the series.

(I changed the series name to Sunday Bayes, since I’ll be posting these on every Sunday.)

### Testing precise hypotheses

This would indicate that say, claiming that a P-value of .05 is significant evidence against a precise hypothesis is sheer folly; the actual Bayes factor may well be near 1, and the posterior probability of Ho near 1/2 (p. 326)

Berger and Delampady (pdf link) review the background and standard practice for testing point null hypotheses (i.e., “precise hypotheses”). The paper came out nearly 30 years ago, so some parts of the discussion may not be as relevant these days, but it’s still a good paper.

They start by reviewing the basic measures of evidence — p-values, Bayes factors, posterior probabilities — before turning to an example. Rereading it, I remember why we gave this paper one of the highest difficulty ratings in the eight steps paper. There is a lot of technical discussion in this paper, but luckily I think most of the technical bits can be skipped in lieu of reading their commentary.

One of the main points of this paper is to investigate precisely when it is appropriate to approximate a small interval null hypothesis by using a point null hypothesis. They conclude, that most of the time, the error of approximation for Bayes factors will be small (<10%),

these numbers suggest that the point null approximation to Ho will be reasonable so long as [the width of the null interval] is one-half a [standard error] in width or smaller. (p. 322)

A secondary point of this paper is to refute the claim that classical answers will typically agree with some “objective” Bayesian analyses. Their conclusion is that such a claim

is simply not the case in the testing of precise hypotheses. This is indicated in Table 1 where, for instance, P(Ho | x) [NB: the posterior probability of the null] is from 5 to 50 times larger than the P-value. (p. 318)

They also review some lower bounds on the amount of Bayesian evidence that corresponds to significant p-values. They sum up their results thusly,

The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

and

the weighted likelihood of H1 is at most [2.5] times that of Ho. A likelihood ratio [NB: Bayes factor] of [2.5] is not particularly strong evidence, particularly when it is [an upper] bound. However, it is customary in practice to view [p] = .05 as strong evidence against Ho. A P-value of [p] = .01, often considered very strong evidence against Ho, corresponds to [BF] = .1227, indicating that H1 is at most 8 times as likely as Ho. The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

### A few choice quotes

Page 319:

[A common opinion is that if] θ0 [NB: a point null] is not in [a confidence interval] it can be rejected, and looking at the set will provide a good indication as to the actual magnitude of the difference between θ and θ0. This opinion is wrong, because it ignores the supposed special nature of θo. A point can be outside a 95% confidence set, yet not be so strongly contraindicated by the data. Only by calculating a Bayes factor … can one judge how well the data supports a distinguished point θ0.

Page 327:

Of course, every statistician must judge for himself or herself how often precise hypotheses actually occur in practice. At the very least, however, we would argue that all types of tests should be able to be properly analyzed by statistics

Page 327 (emphasis original, since that text is a subheading):

[It is commonly argued that] The P-Value Is Just a Data Summary, Which We Can Learn To Properly Calibrate … One can argue that, through experience, one can learn how to interpret P-values. … But if the interpretation depends on Ho, the sample size, the density and the stopping rule, all in crucial ways, it becomes ridiculous to argue that we can intuitively learn to properly calibrate P-values.

page 328:

we would urge reporting both the Bayes factor, B, against [H0] and a confidence or credible region, C. The Bayes factor communicates the evidence in the data against [H0], and C indicates the magnitude of the possible discrepancy.

Page 328:

Without explicit alternatives, however, no Bayes factor or posterior probability could be calculated. Thus, the argument goes, one has no recourse but to use the P-value. A number of Bayesian responses to this argument have been raised … here we concentrate on responding in terms of the discussion in this paper. If, indeed, it is the case that P-values for precise hypotheses essentially always drastically overstate the actual evidence against Ho when the alternatives are known, how can one argue that no problem exists when the alternatives are not known?

### Vote for the next entry:

1. Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
2. Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
3. Gallistel (2009) — The Importance of Proving the Null (pdf)
4. Lindley (2000) — The philosophy of statistics (pdf)

# Type-S and Type-M errors

An anonymous reader of the blog emailed me:
–
I wonder if you’d be ok to help me to understanding this Gelman’s I struggle to understand what is the plotted distribution and the exact meaning of the red area. Of course I read the related article, but it doesn’t help me much.
Rather than write a long-winded email, I figured it will be easier to explain on the blog using some step by step illustrations. With the anonymous reader’s permission I am sharing the question and this explanation for all to read. The graph in question is reproduced below. I will walk through my explanation by building up to this plot piecewise with the information we have about the specific situation referenced in the related paper. The paper, written by Andrew Gelman and John Carlin, illustrates the concepts of Type-M errors and Type-S errors. From the paper:
We frame our calculations not in terms of Type 1 and Type 2 errors but rather Type S (sign) and Type M (magnitude) errors, which relate to the probability that claims with confidence have the wrong sign or are far in magnitude from underlying effect sizes (p. 2)
So Gelman’s graph is an attempt to illustrate these types of errors. I won’t go into the details of the paper since you can read it yourself! I was asked to explain this graph though, which isn’t in the paper, so we’ll go through step by step building our own type-s/m graph in order to build an understanding. The key idea is this: if the underlying true population mean is small and sampling error is large, then experiments that achieve statistical significance must have exaggerated effect sizes and are likely to have the wrong sign. The graph in question:
A few technical details: Here Gelman is plotting a sampling distribution for a hypothetical experiment. If one were to repeatedly take a sample from a population, then each sample mean would be different from the true population mean by some amount due to random variation. When we run an experiment, we essentially pick a sample mean from this distribution at random. Picking at random, sample means tend to be near the true mean of the population, and the how much these random sample means vary follows a curve like this. The height of the curve represents the relative frequency for a sample mean in a series of random picks. Obtaining sample means far away from the true mean is relatively rare since the height of the curve is much lower the farther out we go from the population mean. The red shaded areas indicate values of sample means that achieve statistical significance (i.e., exceed some critical value).
–
The distribution’s form is determined by two parameters: a location parameter and a scale parameter. The location parameter is simply the mean of the distribution (μ), and the scale parameter is the standard deviation of the distribution (σ). In this graph, Gelman defines the true population mean to be 2 based on his experience in this research area; the standard deviation is equal to the sampling error (standard error) of our procedure, which in this case is approximately 8.1 (estimated from empirical data; for more information see the paper, p. 6). The extent of variation in sample means is determined by the amount of sampling error present in our experiment. If measurements are noisy, or if the sample is small, or both, then sampling error goes up. This is reflected in a wider sampling distribution. If we can refine our measurements, or increase our sample size, then sampling error goes down and we see a narrower sampling distribution (smaller value of σ).

### Let’s build our own Type-S and Type-M graph

In Gelman’s graph the mean of the population is 2, and this is indicated by the vertical blue line at the peak of the curve. Again, this hypothetical true value is determined by Gelman’s experience with the topic area. The null hypothesis states that the true mean of the population is zero, and this is indicated by the red vertical line. The hypothetical sample mean from Gelman’s paper is 17, which I’ve added as a small grey diamond near the x-axis. R code to make all figures is provided at the end of this post (except the gif).
If we assume that the true population mean is actually zero (indicated by the red vertical line), instead of 2, then the sampling distribution has a location parameter of 0 and a scale parameter of 8.1. This distribution is shown below. The diamond representing our sample mean corresponds to a fairly low height on the curve, indicating that it is relatively rare to obtain such a result under this sampling distribution.
Next we need to define cutoffs for statistically significant effects (the red shaded areas under the curve in Gelman’s plot) using the null value combined with the sampling error of our procedure. Since this is a two-sided test using an alpha of 5%, we have one cutoff for significance at approximately -15.9 (i.e., 0 – [1.96 x 8.1]) and the other cutoff at approximately 15.9 (i.e., 0 + [1.96 x 8.1]). Under the null sampling distribution, the shaded areas are symmetrical. If we obtain a sample mean that lies beyond these cutoffs we declare our result statistically significant by conventional standards. As you can see, the diamond representing our sample mean of 17 is just beyond this cutoff and thus achieves statistical significance.
But Gelman’s graph assumes the population mean is actually 2, not zero. This is important because we can’t actually have a sign error or a magnitude error if there isn’t a true sign or magnitude. We can adjust the curve so that the peak is above 2 by shifting it over slightly to the right. The shaded areas begin in the same place on the x-axis as before (+/- 15.9), but notice that they have become asymmetrical. This is due to the fact that we shifted the entire distribution slightly to the right, shrinking the left shaded area and expanding the right shaded area.
And there we have our own beautiful type-s and type-m graph. Since the true population mean is small and positive, any sample mean falling in the left tail has the wrong sign and vastly overestimates the population mean (-15.9 vs. 2). Any sample mean falling in the right tail has the correct sign, but again vastly overestimates the population mean (15.9 vs. 2). Our sample mean falls squarely in the right shaded tail. Since the standard error of this procedure (8.1) is much larger than the true population mean (2), any statistically significant result must have a sample mean that is much larger in magnitude than the true population mean, and is quite likely to have the wrong sign.
In this case the left tail contains 24% of the total shaded area under the curve, so in repeated sampling a full 24% of significant results will be in the wrong tail (and thus be a sign error). If the true population mean were still positive but larger in magnitude then the shaded area in the left tail would become smaller and smaller, as it did when we shifted the true population mean from zero to 2, and thus sign errors would be less of a problem. As Gelman and Carlin summarize,
setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7. Thus, it is quite likely that a study designed in this way would lead to an estimate that is in the wrong direction, and if “significant,” it is likely to be a huge overestimate of the pattern in the population. (p. 6)
Here is a neat gif showing our progression! Thanks for reading 🙂

(I don’t think this disclaimer is needed but here it goes: I don’t think people should actually use repeated-sampling statistical inference. This is simply an explanation of the concept. Be a Bayesian!)

### R code

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
view raw gistfile1.txt hosted with ❤ by GitHub

# Edwards, Lindman, and Savage (1963) on why the p-value is still so dominant

Below is an excerpt from Edwards, Lindman, and Savage (1963, pp. 236-7), on why p-value procedures continue to be dominant in the empirical sciences even after it has been repeatedly shown to be an incoherent and nonsensical statistic (note: those are my choice of words, the authors are very cordial in their commentary). The age of the article shows in numbers 1 and 2, but I think it is still valuable commentary; Numbers 3 and 4 are still highly relevant today.

From Edwards, Lindman, and Savage (1963, pp. 236-7):

If classical significance tests have rather frequently rejected true null hypotheses without real evidence, why have they survived so long and so dominated certain empirical sciences ? Four remarks seem to shed some light on this important and difficult question.

1. In principle, many of the rejections at the .05 level are based on values of the test statistic far beyond the borderline, and so correspond to almost unequivocal evidence [i.e., passing the interocular trauma test]. In practice, this argument loses much of its force. It has become customary to reject a null hypothesis at the highest significance level among the magic values, .05, .01, and .001, which the test statistic permits, rather than to choose a significance level in advance and reject all hypotheses whose test statistics fall beyond the criterion value specified by the chosen significance level. So a .05 level rejection today usually means that the test statistic was significant at the .05 level but not at the .01 level. Still, a test statistic which falls just short of the .01 level may correspond to much stronger evidence against a null hypothesis than one barely significant at the .05 level. …

2. Important rejections at the .05 or .01 levels based on test statistics which would not have been significant at higher levels are not common. Psychologists tend to run relatively large experiments, and to get very highly significant main effects. The place where .05 level rejections are most common is in testing interactions in analyses of variance—and few experimenters take those tests very seriously, unless several lines of evidence point to the same conclusions. [emphasis added]

3. Attempts to replicate a result are rather rare, so few null hypothesis rejections are subjected to an empirical check. When such a check is performed and fails, explanation of the anomaly almost always centers on experimental design, minor variations in technique, and so forth, rather than on the meaning of the statistical procedures used in the original study.

4. Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data […] Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. […]The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses. [emphasis added]

References

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological review, 70(3), 193-242.

# Are all significance tests made of the same stuff?

No! If you are like most of the sane researchers out there, you don’t spend your days and nights worrying about the nuances of different statistical concepts. Especially ones as traditional as these. But there is one concept that I think we should all be aware of: P-values mean very different things to different people. Richard Royall (1997, p. 76-7) provides a smattering of different possible interpretations and fleshes out the arguments for why these mixed interpretations are problematic (much of this post comes from his book):

In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. (Daniel, 1991, p. 192)

A nonsignificant result does not prove that the null hypothesis is correct — merely that it is tenable — our data do not give adequate grounds for rejecting it. (Snedecor and Cochran, 1980, p. 66)

The verdict does not depend on how much more readily some other hypothesis would explain the data. We do not even start to take that question seriously until we have rejected the null hypothesis. …..The statistical significance level is a statement about evidence… If it is small enough, say p = 0.001, we infer that the result is not readily explained as a chance outcome if the null hypothesis is true and we start to look for an alternative explanation with considerable assurance. (Murphy, 1985, p. 120)

If [the p-value] is small, we have two explanations — a rare event has happened, or the assumed distribution is wrong. This is the essence of the significance test argument. Not to reject the null hypothesis … means only that it is accepted for the moment on a provisional basis. (Watson, 1983)

Test of hypothesis. A procedure whereby the truth or falseness of the tested hypothesis is investigated by examining a value of the test statistic computed from a sample and then deciding to reject or accept the tested hypothesis according to whether the value falls into the critical region or acceptance region, respectively. (Remington and Schork, 1970, p. 200)

Although a ‘significant’ departure provides some degree of evidence against a null hypothesis, it is important to realize that a ‘nonsignificant’ departure does not provide positive evidence in favour of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis. (Armitage and Berry, 1987, p. 96)

If that value [of the test statistic] is in the region of rejection, the decision is to reject H0; if that value is outside the region of rejection, the decision is that H0 cannot be rejected at the chosen level of significance … The reasoning behind this decision process is very simple. If the probability associated with the occurance under the null hypothesis of a particular value in the sampling distribution is very small, we may explain the actual occurrence of that value in two ways; first we may explain it by deciding that the null hypothesis is false or, second, we may explain it by deciding that a rare and unlikely event has occurred. (Siegel and Castellan, 1988, Chapter 2)

These all mix and match three distinct viewpoints with regard to hypothesis tests: 1) Neyman-Pearson decision procedures, 2) Fisher’s p-value significance tests, and 3) Fisher’s rejection trials (I think 2 and 3 are sufficiently different to be considered separately). Mixing and matching them is inappropriate, as will be shown below. Unfortunately, they all use the same terms so this can get confusing! I’ll do my best to keep things simple.

1. Neyman-Pearson (NP) decision procedure:
Neyman describes it thusly:

The problem of testing a statistical hypothesis occurs when circumstances force us to make a choice between two courses of action: either take step A or take step B… (Neyman 1950, p. 258)

…any rule R prescribing that we take action A when the sample point … falls within a specified category of points, and that we take action B in all other cases, is a test of a statistical hypothesis. (Neyman 1950, p. 258)

The terms ‘accepting’ and ‘rejecting’ a statistical hypothesis are very convenient and well established. It is important, however, to keep their exact meaning in mind and to discard various additional implications which may be suggested by intuition. Thus, to accept a hypothesis H means only to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true. Also if the application … ‘rejects’ H, this means only that the rule prescribes action B and does not imply that we believe that H is false. (Neyman 1950, p. 259)

So what do we take from this? NP testing is about making a decision to choose H0 or H1, not about shedding light on the truth of any one hypothesis or another. We calculate a test statistic, see where it lies with regard to our predefined rejection regions, and make the corresponding decision. We can assure that we are not often wrong by defining Type I and Type II error probabilities (α and β) to be used in our decision procedure. According to this framework, a good test is one that minimizes these long-run error probabilities. It is important to note that this procedure cannot tell us anything about the truth of hypotheses and does not provide us with a measure of evidence of any kind, only a decision to be made according to our criteria. This procedure is notably symmetric — that is, we can either choose H0 or H1.

Test results would look like this:

α and β were prespecified -based on relevant costs associated with the different errors- for this situation at yadda yadda yadda. The test statistic (say, t=2.5) falls inside the rejection region for H0 defined as t>2.0 so we reject H0 and accept H1.” (Alternatively, you might see “p < α = x so we reject H0. The exact value of p is irrelevant, it is either inside or outside of the rejection region defined by α. Obtaining a p = .04 is effectively equivalent to p = .001 for this procedure, as is obtaining a result very much larger than the critical t above.)

2. Fisher’s p-value significance tests

Fisher’s first procedure is only ever concerned with one hypothesis- that being the null. This procedure is not concerned with making decisions (and when in science do we actually ever do that anyway?) but with measuring evidence against the hypothesis. We want to evaluate ‘the strength of evidence against the hypothesis’ (Fisher, 1958, p.80) by evaluating how rare our particular result (or even bigger results) would be if there were really no effect in the study. Our objective here is to calculate a single number that Fisher called the level of significance, or the p-value. Smaller p is more evidence against the hypothesis than larger p. Increasing levels of significance* are often represented** by more asterisks*** in tables or graphs. More asterisks mean lower p-values, and presumably more evidence against the null.

What is the rationale behind this test? There are only two possible interpretations of our low p: either a rare event has occurred, or the underlying hypothesis is false. Fisher doesn’t think the former is reasonable, so we should assume the latter (Bakan, 1966).

Note that this procedure is directly trying to measure the truth value of a hypothesis. Lower ps indicate more evidence against the hypothesis. This is based on the Law of Improbability, that is,

Law of Improbability: If hypothesis A implies that the probability that a random variable X takes on the value x is quite small, say p(x), then the observation X = x is evidence against A, and the smaller p(x), the stronger the evidence. (Royall, 1997, p. 65)

In a future post I will attempt to show why this law is not a valid indicator of evidence. For the purpose of this post we just need to understand the logic behind this test and that it is fundamentally different from NP procedures. This test alone does not provide any guidance with regard to taking action or making a decision, it is intended as a measure of evidence against a hypothesis.

Test results would look like this:

The present results obtain a t value of 2.5, which corresponds to an observed p = .01**. This level of significance is very small and indicates quite strong evidence against the hypothesis of no difference.

3. Fisher’s rejection trials

This is a strange twist on both of the other procedures above, taking elements from each to form a rejection trial. This test is a decision procedure, much like NP procedures, but with only one explicitly defined hypothesis, a la p-value significance tests. The test is most like what psychologists actually use today, framed as two possible decisions, again like NP, but now they are framed in terms of only one hypothesis. Rejection regions are back too, defined as a region of values that have small probability under H0 (i.e., defined by a small α). It is framed as a problem of logic, specifically,

…a process analogous to testing a proposition in formal logic via the argument known as modus tollens, or ‘denying the consequent’: if A implies B, then not-B implies not-A. We can test A by determining whether B is true. If B is false, then we conclude that A is false. But, on the other hand, if B is found to be true we cannot conclude that A is true. That is, A can be proven false by such a test but it cannot be proven true — either we disprove A or we fail to disprove it…. When B is found to be true, so that A survives the test, this result, although not proving A, does seem intuitively to be evidence supporting A. (Royall, 1997, p. 72)

An important caveat is that these tests are probabilistic in nature, so the logical implications aren’t quite right. Nevertheless, rejection trials are what Fisher referred to when he famously said,

Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis… The notion of an error of the so-called ‘second kind,’ due to accepting the null hypothesis ‘when it is false’ … has no meaning with reference to simple tests of significance. (Fisher, 1966)

So there is a major difference from NP — With rejection trials you have a single hypothesis (as opposed to 2) combined with decision rules of “reject the H0 or do not reject H0” (as opposed to reject H0/H1 or accept H0/H1). With rejection trials we are back to making a decision. This test is asymmetric (as opposed to NP which is symmetric) — that is, we can only ever reject H0, never accept it.

While we are making decisions with rejection trials, the decisions have a different meaning than that of NP procedures. In this framework, deciding to reject H0 implies the hypothesis is “inconsistent with the data” or that the data “provide sufficient evidence to cause rejection” of the hypothesis (Royall, 1997, p.74). So rejection trials are intended to be both decision procedures and measures of evidence. Test statistics that fall into smaller α regions are considered stronger evidence, much the same way that a smaller p-value indicates more evidence against the hypothesis. For NP procedures α is simply a property of the test, and choosing a lower one has no evidential meaning per se (although see Mayo, 1996 for a 4th significance procedure — severity testing).

Test results would look like this:

The present results obtain a t = 2.5, p = .01, which is sufficiently strong evidence against H0 to warrant its rejection.

What is the takeaway?

If you aren’t aware of the difference between the three types of hypothesis testing procedures, you’ll find yourself jumbling them all up (Gigerenzer, 2004). If you aren’t careful, you may end up thinking you have a measure of evidence when you actually have a guide to action.

Which one is correct?

Funny enough, I don’t endorse any of them. I contend that p-values never measure evidence (in either p-value procedures or rejection trials) and NP procedures lead to absurdities that I can’t in good faith accept while simultaneously endorsing them.

Why write 2000 words clarifying the nuanced differences between three procedures I think are patently worthless? Well, did you see what I said at the top referring to sane researchers?

A future post is coming that will explicate the criticisms of each procedure, many of the points again coming from Royall’s book.

References

Armitage, P., & Berry, G. (1987). Statistical methods in medical research. Oxford: Blackwell Scientific.

Bakan, D. (1966). The test of significance in psychological research.Psychological bulletin, 66(6), 423.

Daniel, W. W. (1991). Hypothesis testing. Biostatistics: a foundation for analysis in the health sciences5, 191.

Fisher, R. A. (1958).Statistical methods for research workers (13th ed.). New York: Hafner.

Fisher, R. A. (1966). The design of experiments (8th edn.) Oliver and Boyd.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics,33(5), 587-606.

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Murphy, E. A. (1985). A companion to medical statistics. Johns Hopkins University Press.

Neyman, J. (1950). First course in probability and statistic. Published by Henry Holt, 1950.,1.

Remington, R. D., & Schork, M. A. (1970). Statistics with applications to the biological and health sciences.

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Siegel, S. C., & Castellan, J. NJ (1988). Nonparametric statistics for the behavioural sciences. New York, McGraw-Hill.

Snedecor, G. W. WG Cochran. 1980. Statistical Methods. Iowa State Univ. Press, Ames.

Watson, G. S. (1983). Hypothesis testing. Encyclopedia of Statistics in Quality and Reliability.

# Practice Makes Perfect (p<.05)

What’s wrong with [null-hypothesis significance testing]? Well… it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! (Cohen 1994, pg 997)

That quote was written by Jacob Cohen in 1994.What does it mean? Let’s start from the top.

A null-hypothesis significance test (NHST) is a statistical test in which one wishes to test a research hypothesis. For example, say I hypothesize that practicing  improves performance (makes you faster) when building a specific lego set. So I go out and collect some data to see how much people improve on average from a pretest to a post test- one group with no practice (control group) and another group with practice (experimental group). I end up finding that people improve by five minutes when they practice and they don’t improve when they don’t practice. That seems to support my hypothesis that practice leads to improvement!

Typically, however, in my field (psychology) one does not simply test their research hypothesis directly, first one sets up a null-hypothesis (i.e., H0, typically the opposite of their real hypothesis: e.g., no effect, no difference between means, etc.) and collects data trying to show that the null-hypothesis isn’t true. To test my hypothesis using NHST, I would first have to imagine that I’m in a fictitious world where practicing on this measure doesn’t actually improve performance (H0 = no difference in improvement between groups). Then I calculate the likelihood of finding results at least as extreme as the ones i found. If the chance of finding results at least as extreme as mine is less than 5%, we reject the null-hypothesis and say it is unlikely to be true.

In other words, I calculate the probability of finding a difference of improvement between groups of at least 5 minutes on my lego building task- remember, in a world where practicing doesn’t make you better and the groups improvements aren’t different- and I find that my probability (p-value) is 1%. Wow! That’s pretty good. Definitely less than 5% so I can reject the null-hypothesis of no improvement when people practice.

But what do I really learn from a significance test? A p-value only tells me the chance that I should find data like mine in a hypothetical world, a world that I don’t think is true, and I don’t want to be true. Then when I find data that seem unlikely in a world where H0 is true, I conclude that it likely isn’t true. The logic of the argument is thus:

If H0 is true, then this result (statistical significance) would probably not occur.

This result has occurred.

Then H0 is probably not true [….] (Cohen, 1994 pg 998)

So: if it’s unlikely to find data like mine in a world where H0 is true, then it is unlikely that the null-hypothesis is true. We want to say is how likely our null-hypothesis is by looking at our data.  That’s inverse reasoning though. We don’t have any information about the likelihood of H0, we just did an experiment where we pretended that it was true! How can our results from a world in which H0 is true provide evidence that it isn’t true? It’s already assumed to be true in our calculations! We only make the decision to reject H0 because one day we arbitrarily decided that our cut-off was 5%, and anything smaller than that means we don’t believe H0 true.

Maybe this will make it more clear why that reasoning is bad:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

This person is a member of Congress.

Therefore, he is probably not an American. (ibid)

That’s the same logical structure that the null-hypothesis test takes. Obviously incoherent when we put it like that right?

This problem arises because we want to say “it is unlikely that the null-hypothesis is true,” but what we really say with a p-value is, “it is unlikely to find this extreme of data when the null-hypothesis is true.” Those are very different statements. One gives a likelihood of a hypothesis given a data set, P( Hypothesis | Data) and the other gives a likelihood of data given a hypothesis, P( Data | Hypothesis). No matter how much we wish for it to be true, the two probabilities are not the same. They’re never going to be the same. P-values will never tell us what we want them to tell us. We should stop pretending they do and we should acknowledge the limited inferential ability of our NHST.

Thanks for reading, comment if you’d like.