Sunday Bayes: Testing precise hypotheses

First and foremost, when testing precise hypotheses, formal use of P-values should be abandoned. Almost anything will give a better indication of the evidence provided by the data against Ho.

–Berger & Delampady, 1987 (pdf link)

Sunday Bayes series intro:

After the great response to the eight easy steps paper we posted, I started a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format is short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next. This paper was voted to be the next in the series.

(I changed the series name to Sunday Bayes, since I’ll be posting these on every Sunday.)

Testing precise hypotheses

This would indicate that say, claiming that a P-value of .05 is significant evidence against a precise hypothesis is sheer folly; the actual Bayes factor may well be near 1, and the posterior probability of Ho near 1/2 (p. 326)

Berger and Delampady (pdf link) review the background and standard practice for testing point null hypotheses (i.e., “precise hypotheses”). The paper came out nearly 30 years ago, so some parts of the discussion may not be as relevant these days, but it’s still a good paper.

They start by reviewing the basic measures of evidence — p-values, Bayes factors, posterior probabilities — before turning to an example. Rereading it, I remember why we gave this paper one of the highest difficulty ratings in the eight steps paper. There is a lot of technical discussion in this paper, but luckily I think most of the technical bits can be skipped in lieu of reading their commentary.

One of the main points of this paper is to investigate precisely when it is appropriate to approximate a small interval null hypothesis by using a point null hypothesis. They conclude, that most of the time, the error of approximation for Bayes factors will be small (<10%),

these numbers suggest that the point null approximation to Ho will be reasonable so long as [the width of the null interval] is one-half a [standard error] in width or smaller. (p. 322)

A secondary point of this paper is to refute the claim that classical answers will typically agree with some “objective” Bayesian analyses. Their conclusion is that such a claim

is simply not the case in the testing of precise hypotheses. This is indicated in Table 1 where, for instance, P(Ho | x) [NB: the posterior probability of the null] is from 5 to 50 times larger than the P-value. (p. 318)

They also review some lower bounds on the amount of Bayesian evidence that corresponds to significant p-values. They sum up their results thusly,

The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)


the weighted likelihood of H1 is at most [2.5] times that of Ho. A likelihood ratio [NB: Bayes factor] of [2.5] is not particularly strong evidence, particularly when it is [an upper] bound. However, it is customary in practice to view [p] = .05 as strong evidence against Ho. A P-value of [p] = .01, often considered very strong evidence against Ho, corresponds to [BF] = .1227, indicating that H1 is at most 8 times as likely as Ho. The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

A few choice quotes

Page 319:

[A common opinion is that if] θ0 [NB: a point null] is not in [a confidence interval] it can be rejected, and looking at the set will provide a good indication as to the actual magnitude of the difference between θ and θ0. This opinion is wrong, because it ignores the supposed special nature of θo. A point can be outside a 95% confidence set, yet not be so strongly contraindicated by the data. Only by calculating a Bayes factor … can one judge how well the data supports a distinguished point θ0.

Page 327:

Of course, every statistician must judge for himself or herself how often precise hypotheses actually occur in practice. At the very least, however, we would argue that all types of tests should be able to be properly analyzed by statistics

Page 327 (emphasis original, since that text is a subheading):

[It is commonly argued that] The P-Value Is Just a Data Summary, Which We Can Learn To Properly Calibrate … One can argue that, through experience, one can learn how to interpret P-values. … But if the interpretation depends on Ho, the sample size, the density and the stopping rule, all in crucial ways, it becomes ridiculous to argue that we can intuitively learn to properly calibrate P-values.

page 328:

we would urge reporting both the Bayes factor, B, against [H0] and a confidence or credible region, C. The Bayes factor communicates the evidence in the data against [H0], and C indicates the magnitude of the possible discrepancy.

Page 328:

Without explicit alternatives, however, no Bayes factor or posterior probability could be calculated. Thus, the argument goes, one has no recourse but to use the P-value. A number of Bayesian responses to this argument have been raised … here we concentrate on responding in terms of the discussion in this paper. If, indeed, it is the case that P-values for precise hypotheses essentially always drastically overstate the actual evidence against Ho when the alternatives are known, how can one argue that no problem exists when the alternatives are not known?

Vote for the next entry:

  1. Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
  2. Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
  3. Gallistel (2009) — The Importance of Proving the Null (pdf)
  4. Lindley (2000) — The philosophy of statistics (pdf)

The general public has no idea what “statistically significant” means

The title of this piece shouldn’t shock anyone who has taken an introductory statistics course. Statistics is full of terms that have a specific statistical meaning apart from their everyday meaning. A few examples:

Significant, confidence, power, random, mean, normal, credible, moment, bias, interaction, likelihood, error, loadings, weights, hazard, risk, bootstrap, information, jack-knife, kernel, reliable, validity; and that’s just the tip of the iceberg. (Of course, one’s list gets bigger the more statistics courses one takes.)

It should come as no surprise that the general public mistakes a term’s statistical meaning for its general english meaning when nearly every word has some sort of dual-meaning.

Philip Tromovitch (2015) has recently put out a neat paper in which he surveyed a little over 1,000 members of the general public on their understanding of the meaning of “significant,” a term which has a very precise statistical definition: assuming the null hypothesis is true (usually defined as no effect), discrepancies as large or larger than this result would be so rare that we should act as if the null hypothesis isn’t true and we won’t often be wrong.

However, in everyday english, something that is significant means that it is noteworthy or worth our attention. Rather than give a cliched dictionary definition, I asked my mother what she thought. She says she would interpret a phrase such as, “there was a significant drop in sales from 2013 to 2014” to indicate that the drop in sales was “pretty big, like quite important.” (thanks mom 🙂 ) But that’s only one person. What did Tromovitch’s survey respondents think?

Tromovitch surveyed a total of 1103 people. He asked 611 of his respondents to answer this multiple choice question, and the rest answered a variant as an open ended question. Here is the multiple choice question to his survey respondents:

When scientists declare that the finding in their study is “significant,” which of the following do you suspect is closest to what they are saying:

  • the finding is large
  • the finding is important
  • the finding is different than would be expected by chance
  • the finding was unexpected
  • the finding is highly accurate
  • the finding is based on a large sample of data

Respondents choosing the first two responses were considered to be incorrectly using general english, choosing the third answer was considered correct, and choosing any of the final three were considered other incorrect answer. He separated general public responses from those with doctorate degrees (n=15), but he didn’t get any information on what topic their degree was in, so I’ll just refer to the rest of the sample’s results from here on since the doctorate sample should really be taken with a grain of salt.

Roughly 50% of respondents gave a general english interpretation of the “significant” results (options 1 or 2), roughly 40% chose one of the other three wrong responses (options 4, 5, or 6), and less than 10% actually chose the correct answer (option 3). Even if they were totally guessing you’d expect them to get close to 17% correct (1/6), give or take.

But perhaps multiple choice format isn’t the best way to get at this, since the prompt itself provides many answers that sound perfectly reasonable. Tromovitch also asked this as an open-ended question to see what kind of responses people would generate themselves. One variant of the prompt explicitly mentions that he wants to know about statistical significance, while the other simply mentions significance. The exact wording was this:

Scientists sometimes conclude that the finding in their study is “[statistically] significant.” If you were updating a dictionary of modern American English, how would you define the term “[statistically] significant”?

Did respondents do any better when they can answer freely? Not at all. Neither prompt had a very high success rate; they had correct response rates at roughly 4% and 1%. This translates to literally 12 correct answers out of the total 492 respondents of both prompts combined (including phd responses). Tromovitch includes all of these responses in the appendix so you can read the kinds of answers that were given and considered to be correct.

If you take a look at the responses you’ll see that most of them imply some statement about the probability of one hypothesis or the other being true, which isn’t allowed by the correct definition of statistical significance! For example, one answer coded as correct said, “The likelihood that the result/findings are not due to chance and probably true” is blatantly incorrect. The probability that the results are not due to chance is not what statistical significance tells you at all. Most of the responses coded as “correct” by Tromovitch are quite vague, so it’s not clear that even those correct responders have a good handle on the concept. No wonder the general public looks at statistics as if they’re some hand-wavy magic. They don’t get it at all.

snape 2

My takeaway from this study is the title of this piece: the general public has no idea what statistical significance means. That’s not surprising when you consider that researchers themselves often don’t know what it means! Even professors who teach research methods and statistics get this wrong. Results from Haller & Krauss (2002), building off of Oakes (1986), suggest that it is normal for students, academic researchers, and even methodology instructors to endorse incorrect interpretations of p-values and significance tests. That’s pretty bad. It’s one thing for first-year students or the lay public to be confused, but educated academics and methodology instructors too? If you don’t buy the survey results, open up any journal issue in any psychology journal and you’ll find tons of examples of misinterpretation and confusion.

Recently Hoekstra, Morey, Rouder, & Wagenmakers (2014) demonstrated that confidence intervals are similarly misinterpreted by researchers, despite recent calls (Cumming, 2014) to totally abandon significance tests in lieu of confidence intervals. Perhaps we could toss out the whole lot and start over with something that actually makes sense? Maybe we could try teaching something that people can actually understand?

I’ve heard of this cool thing called Bayesian statistics we could try.



Cumming, G. (2014). The new statistics: Why and how. Psychological Science25(1), 7-29.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1-20.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. Wiley.

Tromovitch, P. (2015). The lay public’s misinterpretation of the meaning of ‘significant’: A call for simple yet significant changes in scientific reporting. Journal of Research Practice, 1(1), 1.

Edwards, Lindman, and Savage (1963) on why the p-value is still so dominant

Below is an excerpt from Edwards, Lindman, and Savage (1963, pp. 236-7), on why p-value procedures continue to be dominant in the empirical sciences even after it has been repeatedly shown to be an incoherent and nonsensical statistic (note: those are my choice of words, the authors are very cordial in their commentary). The age of the article shows in numbers 1 and 2, but I think it is still valuable commentary; Numbers 3 and 4 are still highly relevant today.

From Edwards, Lindman, and Savage (1963, pp. 236-7):

If classical significance tests have rather frequently rejected true null hypotheses without real evidence, why have they survived so long and so dominated certain empirical sciences ? Four remarks seem to shed some light on this important and difficult question.

1. In principle, many of the rejections at the .05 level are based on values of the test statistic far beyond the borderline, and so correspond to almost unequivocal evidence [i.e., passing the interocular trauma test]. In practice, this argument loses much of its force. It has become customary to reject a null hypothesis at the highest significance level among the magic values, .05, .01, and .001, which the test statistic permits, rather than to choose a significance level in advance and reject all hypotheses whose test statistics fall beyond the criterion value specified by the chosen significance level. So a .05 level rejection today usually means that the test statistic was significant at the .05 level but not at the .01 level. Still, a test statistic which falls just short of the .01 level may correspond to much stronger evidence against a null hypothesis than one barely significant at the .05 level. …

2. Important rejections at the .05 or .01 levels based on test statistics which would not have been significant at higher levels are not common. Psychologists tend to run relatively large experiments, and to get very highly significant main effects. The place where .05 level rejections are most common is in testing interactions in analyses of variance—and few experimenters take those tests very seriously, unless several lines of evidence point to the same conclusions. [emphasis added]

3. Attempts to replicate a result are rather rare, so few null hypothesis rejections are subjected to an empirical check. When such a check is performed and fails, explanation of the anomaly almost always centers on experimental design, minor variations in technique, and so forth, rather than on the meaning of the statistical procedures used in the original study.

4. Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data […] Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. […]The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses. [emphasis added]



Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological review, 70(3), 193-242.

Question: Why do we settle for 80% power? Answer: We’re confused.

Coming back to the topic of my previous post, about how we must draw distinct conclusions from different hypothesis test procedures, I’d like to show an example of how these confusions might actually arise in practice. The following example comes from Royall’s book (you really should read it), and questions why we settle for a power of only 80%. It’s a question we’ve probably all thought about at some point. Isn’t 80% power just as arbitrary as p-value thresholds? And why should we settle for such a large probability of error before we even start an experiment?

From Royall (1997, pp. 109-110):

Why is a power of only 0.80 OK?

We begin with a mild peculiarity — why is it that the Type I error rate α is ordinarily required to be 0.05 or 0.01, but a Type II error rate as large as 0.20 is regularly adopted? This often occurs when the sample size for a clinical trial is being determined. In trials that compare a new treatment to an old one, the ‘null’ hypothesis usually states that the new treatment is not better than the old, while the alternative states that it is. The specific alternative value chosen might be suggested by pilot studies or uncontrolled trials that preceded the experiment that is now being planned, and the sample size is determined [by calculating power] with α = 0.05 and β = 0.20. Why is such a large value of β acceptable? Why the severe asymmetry in favor of α? Sometimes, of course, a Type I error would be much more costly than a Type II error would be (e.g. if the new treatment is much more expensive, or if it entails greater discomfort). But sometimes the opposite is true, and we never see studies proposed with α = 0.20 and β = 0.05. No one is satisfied to report that ‘the new treatment is statistically significantly better than the old (p ≤ 0.20)’.

Often the sample-size calculation is first made with β = α = 0.05. But in that case experimenters are usually quite disappointed to see what large values of n are required, especially in trials with binomial (success/failure) outcomes. They next set their sights a bit lower, with α = 0.05 and β = 0.10, and find that n is still ‘too large’. Finally they settle for α = 0.05 and β = 0.20.

Why do they not adjust α and settle for α = 0.20 and β = 0.05? Why is small α a non-negotiable demand, while small β is only a flexible desideratum? A large α would seem to be scientifically unacceptable, indicating a lack of rigor, while a large β is merely undesirable, an unfortunate but sometimes unavoidable consequence of the fact that observations are expensive or that subjects eligible for the trial are hard to find and recruit. We might have to live with a large β, but good science seems to demand that α be small.

What is happening is that the formal Neyman-Pearson machinery is being used, but it is being given a rejection-trial interpretation (Emphasis added). The quantities α and β are not just the respective probabilities of choosing one hypothesis when the other is true; if they were, then calling the first hypothesis H2 and the second H1 would reverse the roles of α and β, and α = 0.20, β = 0.05 would be just as satisfactory for the problem in its new formulation as α = 0.05 and β = 0.20 were in the old one. The asymmetry arises because the quantity α is being used in the dual roles that it plays in rejection trials — it is both the probability of rejecting a hypothesis when that hypothesis is true and the measure of strength of the evidence needed to justify rejection. Good science demands small α because small α is supposed to mean strong evidence. On the other hand, the Type II error probability β is being interpreted simply as the probability of failing to find strong evidence against H1 when the alternative H2 is true (Emphasis added. Recall Fisher’s quote about the impossibility of making Type II errors since we never accept the null.) … When observations are expensive or difficult to obtain we might indeed have to live with a large probability of failure to find strong evidence. In fact, when the expense or difficulty is extreme, we often decide not to do the experiment at all, thereby accpeting values of α = 0 and β = [1].

— End excerpt.

So there we have our confusion, which I alluded to in the previous post. We are imposing rejection-trial reasoning onto the Neyman-Pearson decision framework. We accept a huge β because we interpret our results as a mere failure (to produce strong enough evidence) to reject the null, when really our results imply a decision to accept the ‘null’. Remember, with NP we are always forced to choose between two hypotheses — we can never abstain from this choice because the respective rejection regions for H1 and H2 encompass the entire sample space by definition; that is, any result obtained must fall into one of the rejection regions we’ve defined. We can adjust either α or β (before starting the experiment) as we see fit, based on the relative costs of these errors. Since neither hypothesis is inherently special, adjusting α is as justified as adjusting β and neither has any bearing on the strength of evidence from our experiment.

And surely it doesn’t matter which hypothesis is defined as the null, because then we would just switch the respective α and β — that is, H1 and H2 can be reversed without any penalty in the NP framework. Who cares which hypothesis gets the label 1 or 2?

But imagine the outrage (and snarky blog posts) if we tried swapping out the null hypothesis with our pet hypothesis in a rejection trial. Would anybody buy it if we tried to accept our pet hypothesis simply based on a failure to reject it? Of course not, because that would be absurd. Failing to find strong evidence against a single hypothesis has no logical implication that we have found evidence for that hypothesis. Fisher was right about this one. And this is yet another reason NP procedures and rejection trials don’t mix.

However, when we are using concepts of power and Type II errors, we are working with NP procedures which are completely symmetrical and have no concept of strength of evidence per se. Failure to reject the null hypothesis has the exact same meaning as accepting the null hypothesis — they are simply different ways to say the same thing.  If what you want is to measure evidence, fine; I think we should be measuring evidence in any case. But then you don’t have a relevant concept of power, as Fisher has reiterated time and time again. If you want to use power to help plan experiments (as seems to be recommended just about everywhere you look) then you must cast aside your intuitions about interpreting observations from that experiment as evidence. You must reject the rejection trial and reject notions of statistical evidence. 

Or don’t, but then you’re swimming in a sea of confusion.



Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Are all significance tests made of the same stuff?

No! If you are like most of the sane researchers out there, you don’t spend your days and nights worrying about the nuances of different statistical concepts. Especially ones as traditional as these. But there is one concept that I think we should all be aware of: P-values mean very different things to different people. Richard Royall (1997, p. 76-7) provides a smattering of different possible interpretations and fleshes out the arguments for why these mixed interpretations are problematic (much of this post comes from his book):

In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. (Daniel, 1991, p. 192)

A nonsignificant result does not prove that the null hypothesis is correct — merely that it is tenable — our data do not give adequate grounds for rejecting it. (Snedecor and Cochran, 1980, p. 66)

The verdict does not depend on how much more readily some other hypothesis would explain the data. We do not even start to take that question seriously until we have rejected the null hypothesis. …..The statistical significance level is a statement about evidence… If it is small enough, say p = 0.001, we infer that the result is not readily explained as a chance outcome if the null hypothesis is true and we start to look for an alternative explanation with considerable assurance. (Murphy, 1985, p. 120)

If [the p-value] is small, we have two explanations — a rare event has happened, or the assumed distribution is wrong. This is the essence of the significance test argument. Not to reject the null hypothesis … means only that it is accepted for the moment on a provisional basis. (Watson, 1983)

Test of hypothesis. A procedure whereby the truth or falseness of the tested hypothesis is investigated by examining a value of the test statistic computed from a sample and then deciding to reject or accept the tested hypothesis according to whether the value falls into the critical region or acceptance region, respectively. (Remington and Schork, 1970, p. 200)

Although a ‘significant’ departure provides some degree of evidence against a null hypothesis, it is important to realize that a ‘nonsignificant’ departure does not provide positive evidence in favour of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis. (Armitage and Berry, 1987, p. 96)

If that value [of the test statistic] is in the region of rejection, the decision is to reject H0; if that value is outside the region of rejection, the decision is that H0 cannot be rejected at the chosen level of significance … The reasoning behind this decision process is very simple. If the probability associated with the occurance under the null hypothesis of a particular value in the sampling distribution is very small, we may explain the actual occurrence of that value in two ways; first we may explain it by deciding that the null hypothesis is false or, second, we may explain it by deciding that a rare and unlikely event has occurred. (Siegel and Castellan, 1988, Chapter 2)

These all mix and match three distinct viewpoints with regard to hypothesis tests: 1) Neyman-Pearson decision procedures, 2) Fisher’s p-value significance tests, and 3) Fisher’s rejection trials (I think 2 and 3 are sufficiently different to be considered separately). Mixing and matching them is inappropriate, as will be shown below. Unfortunately, they all use the same terms so this can get confusing! I’ll do my best to keep things simple.

1. Neyman-Pearson (NP) decision procedure:
Neyman describes it thusly:

The problem of testing a statistical hypothesis occurs when circumstances force us to make a choice between two courses of action: either take step A or take step B… (Neyman 1950, p. 258)

…any rule R prescribing that we take action A when the sample point … falls within a specified category of points, and that we take action B in all other cases, is a test of a statistical hypothesis. (Neyman 1950, p. 258)

The terms ‘accepting’ and ‘rejecting’ a statistical hypothesis are very convenient and well established. It is important, however, to keep their exact meaning in mind and to discard various additional implications which may be suggested by intuition. Thus, to accept a hypothesis H means only to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true. Also if the application … ‘rejects’ H, this means only that the rule prescribes action B and does not imply that we believe that H is false. (Neyman 1950, p. 259)

So what do we take from this? NP testing is about making a decision to choose H0 or H1, not about shedding light on the truth of any one hypothesis or another. We calculate a test statistic, see where it lies with regard to our predefined rejection regions, and make the corresponding decision. We can assure that we are not often wrong by defining Type I and Type II error probabilities (α and β) to be used in our decision procedure. According to this framework, a good test is one that minimizes these long-run error probabilities. It is important to note that this procedure cannot tell us anything about the truth of hypotheses and does not provide us with a measure of evidence of any kind, only a decision to be made according to our criteria. This procedure is notably symmetric — that is, we can either choose H0 or H1.

Test results would look like this:

α and β were prespecified -based on relevant costs associated with the different errors- for this situation at yadda yadda yadda. The test statistic (say, t=2.5) falls inside the rejection region for H0 defined as t>2.0 so we reject H0 and accept H1.” (Alternatively, you might see “p < α = x so we reject H0. The exact value of p is irrelevant, it is either inside or outside of the rejection region defined by α. Obtaining a p = .04 is effectively equivalent to p = .001 for this procedure, as is obtaining a result very much larger than the critical t above.)

2. Fisher’s p-value significance tests 

Fisher’s first procedure is only ever concerned with one hypothesis- that being the null. This procedure is not concerned with making decisions (and when in science do we actually ever do that anyway?) but with measuring evidence against the hypothesis. We want to evaluate ‘the strength of evidence against the hypothesis’ (Fisher, 1958, p.80) by evaluating how rare our particular result (or even bigger results) would be if there were really no effect in the study. Our objective here is to calculate a single number that Fisher called the level of significance, or the p-value. Smaller p is more evidence against the hypothesis than larger p. Increasing levels of significance* are often represented** by more asterisks*** in tables or graphs. More asterisks mean lower p-values, and presumably more evidence against the null.

What is the rationale behind this test? There are only two possible interpretations of our low p: either a rare event has occurred, or the underlying hypothesis is false. Fisher doesn’t think the former is reasonable, so we should assume the latter (Bakan, 1966).

Note that this procedure is directly trying to measure the truth value of a hypothesis. Lower ps indicate more evidence against the hypothesis. This is based on the Law of Improbability, that is,

Law of Improbability: If hypothesis A implies that the probability that a random variable X takes on the value x is quite small, say p(x), then the observation X = x is evidence against A, and the smaller p(x), the stronger the evidence. (Royall, 1997, p. 65)

In a future post I will attempt to show why this law is not a valid indicator of evidence. For the purpose of this post we just need to understand the logic behind this test and that it is fundamentally different from NP procedures. This test alone does not provide any guidance with regard to taking action or making a decision, it is intended as a measure of evidence against a hypothesis.

Test results would look like this:

The present results obtain a t value of 2.5, which corresponds to an observed p = .01**. This level of significance is very small and indicates quite strong evidence against the hypothesis of no difference.

3. Fisher’s rejection trials

This is a strange twist on both of the other procedures above, taking elements from each to form a rejection trial. This test is a decision procedure, much like NP procedures, but with only one explicitly defined hypothesis, a la p-value significance tests. The test is most like what psychologists actually use today, framed as two possible decisions, again like NP, but now they are framed in terms of only one hypothesis. Rejection regions are back too, defined as a region of values that have small probability under H0 (i.e., defined by a small α). It is framed as a problem of logic, specifically,

…a process analogous to testing a proposition in formal logic via the argument known as modus tollens, or ‘denying the consequent’: if A implies B, then not-B implies not-A. We can test A by determining whether B is true. If B is false, then we conclude that A is false. But, on the other hand, if B is found to be true we cannot conclude that A is true. That is, A can be proven false by such a test but it cannot be proven true — either we disprove A or we fail to disprove it…. When B is found to be true, so that A survives the test, this result, although not proving A, does seem intuitively to be evidence supporting A. (Royall, 1997, p. 72)

An important caveat is that these tests are probabilistic in nature, so the logical implications aren’t quite right. Nevertheless, rejection trials are what Fisher referred to when he famously said,

Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis… The notion of an error of the so-called ‘second kind,’ due to accepting the null hypothesis ‘when it is false’ … has no meaning with reference to simple tests of significance. (Fisher, 1966)

So there is a major difference from NP — With rejection trials you have a single hypothesis (as opposed to 2) combined with decision rules of “reject the H0 or do not reject H0” (as opposed to reject H0/H1 or accept H0/H1). With rejection trials we are back to making a decision. This test is asymmetric (as opposed to NP which is symmetric) — that is, we can only ever reject H0, never accept it.

While we are making decisions with rejection trials, the decisions have a different meaning than that of NP procedures. In this framework, deciding to reject H0 implies the hypothesis is “inconsistent with the data” or that the data “provide sufficient evidence to cause rejection” of the hypothesis (Royall, 1997, p.74). So rejection trials are intended to be both decision procedures and measures of evidence. Test statistics that fall into smaller α regions are considered stronger evidence, much the same way that a smaller p-value indicates more evidence against the hypothesis. For NP procedures α is simply a property of the test, and choosing a lower one has no evidential meaning per se (although see Mayo, 1996 for a 4th significance procedure — severity testing).

Test results would look like this:

The present results obtain a t = 2.5, p = .01, which is sufficiently strong evidence against H0 to warrant its rejection.

What is the takeaway?

If you aren’t aware of the difference between the three types of hypothesis testing procedures, you’ll find yourself jumbling them all up (Gigerenzer, 2004). If you aren’t careful, you may end up thinking you have a measure of evidence when you actually have a guide to action.

Which one is correct?

Funny enough, I don’t endorse any of them. I contend that p-values never measure evidence (in either p-value procedures or rejection trials) and NP procedures lead to absurdities that I can’t in good faith accept while simultaneously endorsing them.

Why write 2000 words clarifying the nuanced differences between three procedures I think are patently worthless? Well, did you see what I said at the top referring to sane researchers?

A future post is coming that will explicate the criticisms of each procedure, many of the points again coming from Royall’s book.


Armitage, P., & Berry, G. (1987). Statistical methods in medical research. Oxford: Blackwell Scientific.

Bakan, D. (1966). The test of significance in psychological research.Psychological bulletin, 66(6), 423.

Daniel, W. W. (1991). Hypothesis testing. Biostatistics: a foundation for analysis in the health sciences5, 191.

Fisher, R. A. (1958).Statistical methods for research workers (13th ed.). New York: Hafner.

Fisher, R. A. (1966). The design of experiments (8th edn.) Oliver and Boyd.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics,33(5), 587-606.

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Murphy, E. A. (1985). A companion to medical statistics. Johns Hopkins University Press.

Neyman, J. (1950). First course in probability and statistic. Published by Henry Holt, 1950.,1.

Remington, R. D., & Schork, M. A. (1970). Statistics with applications to the biological and health sciences.

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Siegel, S. C., & Castellan, J. NJ (1988). Nonparametric statistics for the behavioural sciences. New York, McGraw-Hill.

Snedecor, G. W. WG Cochran. 1980. Statistical Methods. Iowa State Univ. Press, Ames.

Watson, G. S. (1983). Hypothesis testing. Encyclopedia of Statistics in Quality and Reliability.

Can confidence intervals save psychology? Part 2

This is part 2 in a series about confidence intervals (here’s part 1). Answering the question in the title is not really my goal, but simply to discuss confidence intervals and their pros and cons. The last post explained why frequency statistics (and confidence intervals) can’t assign probabilities to one-time events, but always refer to a collective of long-run events.

If confidence intervals don’t really tell us what we want to know, does that mean we should throw them in the dumpster along with our p-values? No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p. Eventually we may want to drop CIs for more nuanced inference, but for the time being we would do much better with this simple switch.

If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean). Can we ever know which ones? Tragically, no. But some would feel pretty good about the process being used if it only has a 5% life-time error rate. One could achieve a lower error rate by stretching the intervals (to say, 99%) but that would leave them too embarrassingly wide for most.

If we use p we will be wrong 5% of the time in the long-run when we are testing a true null-hypothesis (i.e., no association between variables, or no difference between means, etc., and assuming the analysis is 100% pre-planned). But when we are testing a false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run (Button et al., 2013; Cohen, 1962; Sedlmeier & Gigerenzer, 1989). If you are one of the many who do not believe a null-hypothesis can actually be true, then we are always in the latter scenario with that huge error rate. In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result “significant” whenever it lands heads. 

There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? Yes, unfortunately they do (Coulson et al., 2010).


Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but don’t guarantee, better inference than statistical significance testing.Frontiers in psychology, 1, 26.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105(2), 309.