Understanding Bayes: Visualization of the Bayes Factor

In the first post of the Understanding Bayes series I said:

The likelihood is the workhorse of Bayesian inference. In order to understand Bayesian parameter estimation you need to understand the likelihood. In order to understand Bayesian model comparison (Bayes factors) you need to understand the likelihood and likelihood ratios.

I’ve shown in another post how the likelihood works as the updating factor for turning priors into posteriors for parameter estimation. In this post I’ll explain how using Bayes factors for model comparison can be conceptualized as a simple extension of likelihood ratios.

There’s that coin again

Imagine we’re in a similar situation as before: I’ve flipped a coin 100 times and it came up 60 heads and 40 tails. The likelihood function for binomial data in general is:

$\ P \big(X = x \big) \propto \ p^x \big(1-p \big)^{n-x}$

and for this particular result:

$\ P \big(X = 60 \big) \propto \ p^{60} \big(1-p \big)^{40}$

The corresponding likelihood curve is shown below, which displays the relative likelihood for all possible simple (point) hypotheses given this data. Any likelihood ratio can be calculated by simply taking the ratio of the different hypotheses’s heights on the curve.

In that previous post I compared the fair coin hypothesis — H0: P(H)=.5 — vs one particular trick coin hypothesis — H1: P(H)=.75. For 60 heads out of 100 tosses, the likelihood ratio for these hypotheses is L(.5)/L(.75) = 29.9. This means the data are 29.9 times as probable under the fair coin hypothesis than this particular trick coin hypothesisBut often we don’t have theories precise enough to make point predictions about parameters, at least not in psychology. So it’s often helpful if we can assign a range of plausible values for parameters as dictated by our theories.

Enter the Bayes factor

Calculating a Bayes factor is a simple extension of this process. A Bayes factor is a weighted average likelihood ratio, where the weights are based on the prior distribution specified for the hypotheses. For this example I’ll keep the simple fair coin hypothesis as the null hypothesis — H0: P(H)=.5 — but now the alternative hypothesis will become a composite hypothesis — H1: P(θ). (footnote 1) The likelihood ratio is evaluated at each point of P(θ) and weighted by the relative plausibility we assign that value. Then once we’ve assigned weights to each ratio we just take the average to get the Bayes factor. Figuring out how the weights should be assigned (the prior) is the tricky part.

Imagine my composite hypothesis, P(θ), is a combination of 21 different point hypotheses, all evenly spaced out between 0 and 1 and all of these points are weighted equally (not a very realistic hypothesis!). So we end up with P(θ) = {0, .05, .10, .15, . . ., .9, .95, 1}. The likelihood ratio can be evaluated at every possible point hypothesis relative to H0, and we need to decide how to assign weights. This is easy for this P(θ); we assign zero weight for every likelihood ratio that is not associated with one of the point hypotheses contained in P(θ), and we assign weights of 1 to all likelihood ratios associated with the 21 points in P(θ).

This gif has the 21 point hypotheses of P(θ) represented as blue vertical lines (indicating where we put our weights of 1), and the turquoise tracking lines represent the likelihood ratio being calculated at every possible point relative to H0: P(H)=.5. (Remember, the likelihood ratio is the ratio of the heights on the curve.) This means we only care about the ratios given by the tracking lines when the dot attached to the moving arm aligns with the vertical P(θ) lines. [edit: this paragraph added as clarification]

The 21 likelihood ratios associated with P(θ) are:

{~0, ~0, ~0, ~0, ~0, ~0, ~0, ~0, .002, .08, 1, 4.5, 7.5, 4.4, .78, .03, ~0, ~0, ~0, ~0, ~0}

Since they are all weighted equally we simply average, and obtain BF = 18.3/21 = .87. In other words, the data (60 heads out of 100) are 1/.87 = 1.15 times more probable under the null hypothesis — H0: P(H)=.5 — than this particular composite hypothesis — H1: P(θ). Entirely uninformative! Despite tossing the coin 100 times we have extremely weak evidence that is hardly worth even acknowledging. This happened because much of P(θ) falls in areas of extremely low likelihood relative to H0, as evidenced by those 13 zeros above. P(θ) is flexible, since it covers the entire possible range of θ, but this flexibility comes at a price. You have to pay for all of those zeros with a lower weighted average and a smaller Bayes factor.

Now imagine I had seen a trick coin like this before, and I know it had a slight bias towards landing heads. I can use this information to make more pointed predictions. Let’s say I define P(θ) as 21 equally weighted point hypotheses again, but this time they are all equally spaced between .5 and .75, which happens to be the highest density region of the likelihood curve (how fortuitous!). Now P(θ) = {.50, .5125, .525, . . ., .7375, .75}.

The 21 likelihood ratios associated with the new P(θ) are:

{1.00, 1.5, 2.1, 2.8, 4.5, 5.4, 6.2, 6.9, 7.5, 7.3, 6.9, 6.2, 4.4, 3.4, 2.6, 1.8, .78, .47, .27, .14, .03}

They are all still weighted equally, so the simple average is BF = 72/21 = 3.4. Three times more informative than before, and in favor of P(θ) this time! And no zeros. We were able to add theoretically relevant information to H1 to make more accurate predictions, and we get rewarded with a Bayes boost. (But this result is only 3-to-1 evidence, which is still fairly weak.)

This new P(θ) is risky though, because if the data show a bias towards tails or a more extreme bias towards heads then it faces a very heavy penalty (many more zeros). High risk = high reward with the Bayes factor. Make pointed predictions that match the data and get a bump to your BF, but if you’re wrong then pay a steep price. For example, if the data were 60 tails instead of 60 heads the BF would be 10-to-1 against P(θ) rather than 3-to-1 for P(θ)!

Now, typically people don’t actually specify hypotheses like these. Typically they use continuous distributions, but the idea is the same. Take the likelihood ratio at each point relative to H0, weigh according to plausibilities given in P(θ), and then average.

A more realistic (?) example

Imagine you’re walking down the sidewalk and you see a shiny piece of foreign currency by your feet. You pick it up and want to know if it’s a fair coin or an unfair coin. As a Bayesian you have to be precise about what you mean by fair and unfair. Fair is typically pretty straightforward — H0: P(H)=.5 as before — but unfair could mean anything. Since this is a completely foreign coin to you, you may want to be fairly open-minded about it. After careful deliberation, you assign P(θ) a beta distribution, with shape parameters 10 and 10. That is, H1: P(θ) ~ Beta(10, 10). This means that if the coin isn’t fair, it’s probably close to fair but it could reasonably be moderately biased, and you have no reason to think it is particularly biased to one side or the other.

Now you build a perfect coin-tosser machine and set it to toss 100 times (but not any more than that because you haven’t got all day). You carefully record the results and the coin comes up 33 heads out of 100 tosses. Under which hypothesis are these data more probable, H0 or H1? In other words, which hypothesis did the better job predicting these data?

This may be a continuous prior but the concept is exactly the same as before: weigh the various likelihood ratios based on the prior plausibility assignment and then average. The continuous distribution on P(θ) can be thought of as a set of many many point hypotheses spaced very very close together. So if the range of θ we are interested in is limited to 0 to 1, as with binomials and coin flips, then a distribution containing 101 point hypotheses spaced .01 apart, can effectively be treated as if it were continuous. The numbers will be a little off but all in all it’s usually pretty close. So imagine that instead of 21 hypotheses you have 101, and their relative plausibilities follow the shape of a Beta(10, 10). (footnote 2)

Since this is not a uniform distribution, we need to assign varying weights to each likelihood ratio. Each likelihood ratio associated with a point in P(θ) is simply multiplied by the respective density assigned to it under P(θ). For example, the density of P(θ) at .4 is 2.44. So we multiply the likelihood ratio at that point, L(.4)/L(.5) = 128, by 2.44, and add it to the accumulating total likelihood ratio. Do this for every point and then divide by the total number of points, in this case 101, to obtain the approximate Bayes factor. The total weighted likelihood ratio is 5564.9, divide it by 101 to get 55.1, and there’s the Bayes factor. In other words, the data are roughly 55 times more probable under this composite H1 than under H0. The alternative hypothesis H1 did a much better job predicting these data than did the null hypothesis H0.

The actual Bayes factor is obtained by integrating the likelihood with respect to H1’s density distribution and then dividing by the (marginal) likelihood of H0. Essentially what it does is cut P(θ) into slices infinitely thin before it calculates the likelihood ratios, re-weighs, and averages. That Bayes factor comes out to 55.7, which is basically the same thing we got through this ghetto visualization demonstration!

Take home

The take-home message is hopefully pretty clear at this point: When you are comparing a point null hypothesis with a composite hypothesis, the Bayes factor can be thought of as a weighted average of every point hypothesis’s likelihood ratio against H0, and the weights are determined by the prior density distribution of H1. Since the Bayes factor is a weighted average based on the prior distribution, it’s really important to think hard about the prior distribution you choose for H1. In a previous post, I showed how different priors can converge to the same posterior with enough data. The priors are often said to “wash out” in estimation problems like that. This is not necessarily the case for Bayes factors. The priors you choose matter, so think hard!

Notes

Footnote 1: A lot of ink has been spilled arguing about how one should define P(θ). I talked about it a little a previous post.

Footnote 2: I’ve rescaled the likelihood curve to match the scale of the prior density under H1. This doesn’t affect the values of the Bayes factor or likelihood ratios because the scaling constant cancels itself out.

Type-S and Type-M errors

An anonymous reader of the blog emailed me:
–
I wonder if you’d be ok to help me to understanding this Gelman’s I struggle to understand what is the plotted distribution and the exact meaning of the red area. Of course I read the related article, but it doesn’t help me much.
Rather than write a long-winded email, I figured it will be easier to explain on the blog using some step by step illustrations. With the anonymous reader’s permission I am sharing the question and this explanation for all to read. The graph in question is reproduced below. I will walk through my explanation by building up to this plot piecewise with the information we have about the specific situation referenced in the related paper. The paper, written by Andrew Gelman and John Carlin, illustrates the concepts of Type-M errors and Type-S errors. From the paper:
We frame our calculations not in terms of Type 1 and Type 2 errors but rather Type S (sign) and Type M (magnitude) errors, which relate to the probability that claims with confidence have the wrong sign or are far in magnitude from underlying effect sizes (p. 2)
So Gelman’s graph is an attempt to illustrate these types of errors. I won’t go into the details of the paper since you can read it yourself! I was asked to explain this graph though, which isn’t in the paper, so we’ll go through step by step building our own type-s/m graph in order to build an understanding. The key idea is this: if the underlying true population mean is small and sampling error is large, then experiments that achieve statistical significance must have exaggerated effect sizes and are likely to have the wrong sign. The graph in question:
A few technical details: Here Gelman is plotting a sampling distribution for a hypothetical experiment. If one were to repeatedly take a sample from a population, then each sample mean would be different from the true population mean by some amount due to random variation. When we run an experiment, we essentially pick a sample mean from this distribution at random. Picking at random, sample means tend to be near the true mean of the population, and the how much these random sample means vary follows a curve like this. The height of the curve represents the relative frequency for a sample mean in a series of random picks. Obtaining sample means far away from the true mean is relatively rare since the height of the curve is much lower the farther out we go from the population mean. The red shaded areas indicate values of sample means that achieve statistical significance (i.e., exceed some critical value).
–
The distribution’s form is determined by two parameters: a location parameter and a scale parameter. The location parameter is simply the mean of the distribution (μ), and the scale parameter is the standard deviation of the distribution (σ). In this graph, Gelman defines the true population mean to be 2 based on his experience in this research area; the standard deviation is equal to the sampling error (standard error) of our procedure, which in this case is approximately 8.1 (estimated from empirical data; for more information see the paper, p. 6). The extent of variation in sample means is determined by the amount of sampling error present in our experiment. If measurements are noisy, or if the sample is small, or both, then sampling error goes up. This is reflected in a wider sampling distribution. If we can refine our measurements, or increase our sample size, then sampling error goes down and we see a narrower sampling distribution (smaller value of σ).

Let’s build our own Type-S and Type-M graph

In Gelman’s graph the mean of the population is 2, and this is indicated by the vertical blue line at the peak of the curve. Again, this hypothetical true value is determined by Gelman’s experience with the topic area. The null hypothesis states that the true mean of the population is zero, and this is indicated by the red vertical line. The hypothetical sample mean from Gelman’s paper is 17, which I’ve added as a small grey diamond near the x-axis. R code to make all figures is provided at the end of this post (except the gif).
If we assume that the true population mean is actually zero (indicated by the red vertical line), instead of 2, then the sampling distribution has a location parameter of 0 and a scale parameter of 8.1. This distribution is shown below. The diamond representing our sample mean corresponds to a fairly low height on the curve, indicating that it is relatively rare to obtain such a result under this sampling distribution.
Next we need to define cutoffs for statistically significant effects (the red shaded areas under the curve in Gelman’s plot) using the null value combined with the sampling error of our procedure. Since this is a two-sided test using an alpha of 5%, we have one cutoff for significance at approximately -15.9 (i.e., 0 – [1.96 x 8.1]) and the other cutoff at approximately 15.9 (i.e., 0 + [1.96 x 8.1]). Under the null sampling distribution, the shaded areas are symmetrical. If we obtain a sample mean that lies beyond these cutoffs we declare our result statistically significant by conventional standards. As you can see, the diamond representing our sample mean of 17 is just beyond this cutoff and thus achieves statistical significance.
But Gelman’s graph assumes the population mean is actually 2, not zero. This is important because we can’t actually have a sign error or a magnitude error if there isn’t a true sign or magnitude. We can adjust the curve so that the peak is above 2 by shifting it over slightly to the right. The shaded areas begin in the same place on the x-axis as before (+/- 15.9), but notice that they have become asymmetrical. This is due to the fact that we shifted the entire distribution slightly to the right, shrinking the left shaded area and expanding the right shaded area.
And there we have our own beautiful type-s and type-m graph. Since the true population mean is small and positive, any sample mean falling in the left tail has the wrong sign and vastly overestimates the population mean (-15.9 vs. 2). Any sample mean falling in the right tail has the correct sign, but again vastly overestimates the population mean (15.9 vs. 2). Our sample mean falls squarely in the right shaded tail. Since the standard error of this procedure (8.1) is much larger than the true population mean (2), any statistically significant result must have a sample mean that is much larger in magnitude than the true population mean, and is quite likely to have the wrong sign.
In this case the left tail contains 24% of the total shaded area under the curve, so in repeated sampling a full 24% of significant results will be in the wrong tail (and thus be a sign error). If the true population mean were still positive but larger in magnitude then the shaded area in the left tail would become smaller and smaller, as it did when we shifted the true population mean from zero to 2, and thus sign errors would be less of a problem. As Gelman and Carlin summarize,
setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7. Thus, it is quite likely that a study designed in this way would lead to an estimate that is in the wrong direction, and if “significant,” it is likely to be a huge overestimate of the pattern in the population. (p. 6)
I hope I’ve explained this clearly enough for you, anonymous reader (and other readers, of course). Leave a comment below or tweet/email me if anything is unclear!
Here is a neat gif showing our progression! Thanks for reading 🙂

(I don’t think this disclaimer is needed but here it goes: I don’t think people should actually use repeated-sampling statistical inference. This is simply an explanation of the concept. Be a Bayesian!)

Understanding Bayes: A Look at the Likelihood

[This post has been updated and turned into a paper to be published in AMPPS]

Much of the discussion in psychology surrounding Bayesian inference focuses on priors. Should we embrace priors, or should we be skeptical? When are Bayesian methods sensitive to specification of the prior, and when do the data effectively overwhelm it? Should we use context specific prior distributions or should we use general defaults? These are all great questions and great discussions to be having.

One thing that often gets left out of the discussion is the importance of the likelihood. The likelihood is the workhorse of Bayesian inference. In order to understand Bayesian parameter estimation you need to understand the likelihood. In order to understand Bayesian model comparison (Bayes factors) you need to understand the likelihood and likelihood ratios.

What is likelihood?

Likelihood is a funny concept. It’s not a probability, but it is proportional to a probability. The likelihood of a hypothesis (H) given some data (D) is proportional to the probability of obtaining D given that H is true, multiplied by an arbitrary positive constant (K). In other words, L(H|D) = K · P(D|H). Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1.

A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H|D), conditions on the data as if they are fixed while allowing the hypotheses to vary.

The distinction is subtle, so I’ll say it again. For conditional probability, the hypothesis is treated as a given and the data are free to vary. For likelihood, the data are a given and the hypotheses vary.

The Likelihood Axiom

Edwards (1992, p. 30) defines the Likelihood Axiom as a natural combination of the Law of Likelihood and the Likelihood Principle.

The Law of Likelihood states that “within the framework of a statistical model, a particular set of data supports one statistical hypothesis better than another if the likelihood of the first hypothesis, on the data, exceeds the likelihood of the second hypothesis” (Emphasis original. Edwards, 1992, p. 30).

In other words, there is evidence for H1 vis-a-vis H2 if and only if the probability of the data under H1 is greater than the probability of the data under H2. That is, D is evidence for H1 over H2 if P(D|H1) >  P(D|H2). If these two probabilities are equivalent, then there is no evidence for either hypothesis over the other. Furthermore, the strength of the statistical evidence for H1 over H2 is quantified by the ratio of their likelihoods, L(H1|D)/L(H2|D) (which again is proportional to P(D|H1)/P(D|H2) up to an arbitrary constant that cancels out).

The Likelihood Principle states that the likelihood function contains all of the information relevant to the evaluation of statistical evidence. Other facets of the data that do not factor into the likelihood function are irrelevant to the evaluation of the strength of the statistical evidence (Edwards, 1992, p. 30; Royall, 1997, p. 22). They can be meaningful for planning studies or for decision analysis, but they are separate from the strength of the statistical evidence.

Likelihoods are meaningless in isolation

Unlike a probability, a likelihood has no real meaning per se due to the arbitrary constant. Only by comparing likelihoods do they become interpretable, because the constant in each likelihood cancels the other one out. The easiest way to explain this aspect of likelihood is to use the binomial distribution as an example.

Suppose I flip a coin 10 times and it comes up 6 heads and 4 tails. If the coin were fair, p(heads) = .5, the probability of this occurrence is defined by the binomial distribution:

$\ P \big(X = x \big) = \binom{n}{x} p^x \big(1-p \big)^{n-x}$

where x is the number of heads obtained, n is the total number of flips, p is the probability of heads, and

$\binom{n}{x} = \frac{n!}{x! (n-x)!}$

Substituting in our values we get

$\ P \big(X = 6 \big) = \frac{10!}{6! (4!)} \big(.5 \big)^6 \big(1-.5 \big)^{4} \approx .21$

If the coin were a trick coin, so that p(heads) = .75, the probability of 6 heads in 10 tosses is:

$\ P \big(X = 6 \big) = \frac{10!}{6! (4!)} \big(.75 \big)^6 \big(1-.75 \big)^{4} \approx .15$

To quantify the statistical evidence for the first hypothesis against the second, we simply divide one probability by the other. This ratio tells us everything we need to know about the support the data lends to one hypothesis vis-a-vis the other.  In the case of 6 heads in 10 tosses, the likelihood ratio (LR) for a fair coin vs our trick coin is:

$LR = \Bigg(\frac{10!}{6! (4!)} \big(.5 \big)^6 \big(1-.5 \big)^4 \Bigg) \div \Bigg(\frac{10!}{6! (4!)} \big(.75 \big)^6 \big(1-.75 \big)^4 \Bigg) \approx .21/.15 \approx 1.4$

Translation: The data are 1.4 times as probable under a fair coin hypothesis than under this particular trick coin hypothesis. Notice how the first terms in each of the equations above, i.e., $\frac{10!}{6! (4!)}$, are equivalent and completely cancel each other out in the likelihood ratio.

Same data. Same constant. Cancel out.

The first term in the equations above, $\frac{10!}{6! (4!)}$, details our journey to obtaining 6 heads out of 10. If we change our journey (i.e., different sampling plan) then this changes the term’s value, but crucially, since it is the same term in both the numerator and denominator it always cancels itself out. In other words, the information contained in the way the data are obtained disappears from the function. Hence the irrelevance of the stopping rule to the evaluation of statistical evidence, which is something that makes bayesian and likelihood methods valuable and flexible.

If we leave out the first term in the above calculations, our numerator is L(.5) = 0.0009765625 and our denominator is L(.75) ≈ 0.0006952286. Using these values to form the likelihood ratio we get: 0.0009765625/0.0006952286 ≈ 1.4, as we should since the other terms simply cancelled out before.

Again I want to reiterate that the value of a single likelihood is meaningless in isolation; only in comparing likelihoods do we find meaning.

Looking at likelihoods

Likelihoods may seem overly restrictive at first. We can only compare 2 simple statistical hypotheses in a single likelihood ratio. But what if we are interested in comparing many more hypotheses at once? What if we want to compare all possible hypotheses at once?

In that case we can plot the likelihood function for our data, and this lets us ‘see’ the evidence in its entirety. By plotting the entire likelihood function we compare all possible hypotheses simultaneously. The Likelihood Principle tells us that the likelihood function encompasses all statistical evidence that our data can provide, so we should always plot this function along side our reported likelihood ratios.

Following the wisdom of Birnbaum (1962), “the “evidential meaning” of experimental results is characterized fully by the likelihood function” (as cited in Royall, 1997, p.25). So let’s look at some examples. The R script at the end of this post can be used to reproduce these plots, or you can use it to make your own plots. Play around with it and see how the functions change for different number of heads, total flips, and hypotheses of interest. See the instructions in the script for details.

Below is the likelihood function for 6 heads in 10 tosses. I’ve marked our two hypotheses from before on the likelihood curve with blue dots. Since the likelihood function is meaningful only up to an arbitrary constant, the graph is scaled by convention so that the best supported value (i.e., the maximum) corresponds to a likelihood of 1.

The vertical dotted line marks the hypothesis best supported by the data. The likelihood ratio of any two hypotheses is simply the ratio of their heights on this curve. We can see from the plot that the fair coin has a higher likelihood than our trick coin.

How does the curve change if instead of 6 heads out of 10 tosses, we tossed 100 times and obtained 60 heads?

Our curve gets much narrower! How did the strength of evidence change for the fair coin vs the trick coin? The new likelihood ratio is L(.5)/L(.75) ≈ 29.9. Much stronger evidence!(footnote) However, due to the narrowing, neither of these hypothesized values are very high up on the curve anymore. It might be more informative to compare each of our hypotheses against the best supported hypothesis. This gives us two likelihood ratios: L(.6)/L(.5) ≈ 7.5 and L(.6)/L(.75) ≈ 224.

Here is one more curve, for when we obtain 300 heads in 500 coin flips.

Notice that both of our hypotheses look to be very near the minimum of the graph. Yet their likelihood ratio is much stronger than before. For this data the likelihood ratio L(.5)/L(.75) is nearly 24 million! The inherent relativity of evidence is made clear here: The fair coin was supported when compared to one particular trick coin. But this should not be interpreted as absolute evidence for the fair coin, because the likelihood ratio for the maximally supported hypothesis vs the fair coin, L(.6)/L(.5), is nearly 24 thousand!

We need to be careful not to make blanket statements about absolute support, such as claiming that the maximum is “strongly supported by the data”. Always ask, “Compared to what?” The best supported hypothesis will be only be weakly supported vs any hypothesis just before or just after it on the x-axis. For example, L(.6)/L(.61) ≈ 1.1, which is barely any support one way or the other. It cannot be said enough that evidence for a hypothesis must be evaluated in consideration with a specific alternative.

Connecting likelihood ratios to Bayes factors

Bayes factors are simple extensions of likelihood ratios. A Bayes factor is a weighted average likelihood ratio based on the prior distribution specified for the hypotheses. (When the hypotheses are simple point hypotheses, the Bayes factor is equivalent to the likelihood ratio.) The likelihood ratio is evaluated at each point of the prior distribution and weighted by the probability we assign that value. If the prior distribution assigns the majority of its probability to values far away from the observed data, then the average likelihood for that hypothesis is lower than one that assigns probability closer to the observed data. In other words, you get a Bayes boost if you make more accurate predictions. Bayes factors are extremely valuable, and in a future post I will tackle the hard problem of assigning priors and evaluating weighted likelihoods.

I hope you come away from this post with a greater knowledge of, and appreciation for, likelihoods. Play around with the R code and you can get a feel for how the likelihood functions change for different data and different hypotheses of interest.

(footnote) Obtaining 60 heads in 100 tosses is equivalent to obtaining 6 heads in 10 tosses 10 separate times. To obtain this new likelihood ratio we can simply multiply our ratios together. That is, raise the first ratio to the power of 10; 1.4^10 ≈ 28.9, which is just slightly off from the correct value of 29.9 due to rounding.

References

Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57(298), 269-306.

Edwards, A. W. (1992). Likelihood, expanded ed. Johns Hopkins University Press.

Royall, R. (1997). Statistical evidence: A likelihood paradigm (Vol. 71). CRC press.

Edwards, Lindman, and Savage (1963) on why the p-value is still so dominant

Below is an excerpt from Edwards, Lindman, and Savage (1963, pp. 236-7), on why p-value procedures continue to be dominant in the empirical sciences even after it has been repeatedly shown to be an incoherent and nonsensical statistic (note: those are my choice of words, the authors are very cordial in their commentary). The age of the article shows in numbers 1 and 2, but I think it is still valuable commentary; Numbers 3 and 4 are still highly relevant today.

From Edwards, Lindman, and Savage (1963, pp. 236-7):

If classical significance tests have rather frequently rejected true null hypotheses without real evidence, why have they survived so long and so dominated certain empirical sciences ? Four remarks seem to shed some light on this important and difficult question.

1. In principle, many of the rejections at the .05 level are based on values of the test statistic far beyond the borderline, and so correspond to almost unequivocal evidence [i.e., passing the interocular trauma test]. In practice, this argument loses much of its force. It has become customary to reject a null hypothesis at the highest significance level among the magic values, .05, .01, and .001, which the test statistic permits, rather than to choose a significance level in advance and reject all hypotheses whose test statistics fall beyond the criterion value specified by the chosen significance level. So a .05 level rejection today usually means that the test statistic was significant at the .05 level but not at the .01 level. Still, a test statistic which falls just short of the .01 level may correspond to much stronger evidence against a null hypothesis than one barely significant at the .05 level. …

2. Important rejections at the .05 or .01 levels based on test statistics which would not have been significant at higher levels are not common. Psychologists tend to run relatively large experiments, and to get very highly significant main effects. The place where .05 level rejections are most common is in testing interactions in analyses of variance—and few experimenters take those tests very seriously, unless several lines of evidence point to the same conclusions. [emphasis added]

3. Attempts to replicate a result are rather rare, so few null hypothesis rejections are subjected to an empirical check. When such a check is performed and fails, explanation of the anomaly almost always centers on experimental design, minor variations in technique, and so forth, rather than on the meaning of the statistical procedures used in the original study.

4. Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data […] Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. […]The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses. [emphasis added]

References

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological review, 70(3), 193-242.

Question: Why do we settle for 80% power? Answer: We’re confused.

Coming back to the topic of my previous post, about how we must draw distinct conclusions from different hypothesis test procedures, I’d like to show an example of how these confusions might actually arise in practice. The following example comes from Royall’s book (you really should read it), and questions why we settle for a power of only 80%. It’s a question we’ve probably all thought about at some point. Isn’t 80% power just as arbitrary as p-value thresholds? And why should we settle for such a large probability of error before we even start an experiment?

From Royall (1997, pp. 109-110):

Why is a power of only 0.80 OK?

We begin with a mild peculiarity — why is it that the Type I error rate α is ordinarily required to be 0.05 or 0.01, but a Type II error rate as large as 0.20 is regularly adopted? This often occurs when the sample size for a clinical trial is being determined. In trials that compare a new treatment to an old one, the ‘null’ hypothesis usually states that the new treatment is not better than the old, while the alternative states that it is. The specific alternative value chosen might be suggested by pilot studies or uncontrolled trials that preceded the experiment that is now being planned, and the sample size is determined [by calculating power] with α = 0.05 and β = 0.20. Why is such a large value of β acceptable? Why the severe asymmetry in favor of α? Sometimes, of course, a Type I error would be much more costly than a Type II error would be (e.g. if the new treatment is much more expensive, or if it entails greater discomfort). But sometimes the opposite is true, and we never see studies proposed with α = 0.20 and β = 0.05. No one is satisfied to report that ‘the new treatment is statistically significantly better than the old (p ≤ 0.20)’.

Often the sample-size calculation is first made with β = α = 0.05. But in that case experimenters are usually quite disappointed to see what large values of n are required, especially in trials with binomial (success/failure) outcomes. They next set their sights a bit lower, with α = 0.05 and β = 0.10, and find that n is still ‘too large’. Finally they settle for α = 0.05 and β = 0.20.

Why do they not adjust α and settle for α = 0.20 and β = 0.05? Why is small α a non-negotiable demand, while small β is only a flexible desideratum? A large α would seem to be scientifically unacceptable, indicating a lack of rigor, while a large β is merely undesirable, an unfortunate but sometimes unavoidable consequence of the fact that observations are expensive or that subjects eligible for the trial are hard to find and recruit. We might have to live with a large β, but good science seems to demand that α be small.

What is happening is that the formal Neyman-Pearson machinery is being used, but it is being given a rejection-trial interpretation (Emphasis added). The quantities α and β are not just the respective probabilities of choosing one hypothesis when the other is true; if they were, then calling the first hypothesis H2 and the second H1 would reverse the roles of α and β, and α = 0.20, β = 0.05 would be just as satisfactory for the problem in its new formulation as α = 0.05 and β = 0.20 were in the old one. The asymmetry arises because the quantity α is being used in the dual roles that it plays in rejection trials — it is both the probability of rejecting a hypothesis when that hypothesis is true and the measure of strength of the evidence needed to justify rejection. Good science demands small α because small α is supposed to mean strong evidence. On the other hand, the Type II error probability β is being interpreted simply as the probability of failing to find strong evidence against H1 when the alternative H2 is true (Emphasis added. Recall Fisher’s quote about the impossibility of making Type II errors since we never accept the null.) … When observations are expensive or difficult to obtain we might indeed have to live with a large probability of failure to find strong evidence. In fact, when the expense or difficulty is extreme, we often decide not to do the experiment at all, thereby accpeting values of α = 0 and β = [1].

— End excerpt.

So there we have our confusion, which I alluded to in the previous post. We are imposing rejection-trial reasoning onto the Neyman-Pearson decision framework. We accept a huge β because we interpret our results as a mere failure (to produce strong enough evidence) to reject the null, when really our results imply a decision to accept the ‘null’. Remember, with NP we are always forced to choose between two hypotheses — we can never abstain from this choice because the respective rejection regions for H1 and H2 encompass the entire sample space by definition; that is, any result obtained must fall into one of the rejection regions we’ve defined. We can adjust either α or β (before starting the experiment) as we see fit, based on the relative costs of these errors. Since neither hypothesis is inherently special, adjusting α is as justified as adjusting β and neither has any bearing on the strength of evidence from our experiment.

And surely it doesn’t matter which hypothesis is defined as the null, because then we would just switch the respective α and β — that is, H1 and H2 can be reversed without any penalty in the NP framework. Who cares which hypothesis gets the label 1 or 2?

But imagine the outrage (and snarky blog posts) if we tried swapping out the null hypothesis with our pet hypothesis in a rejection trial. Would anybody buy it if we tried to accept our pet hypothesis simply based on a failure to reject it? Of course not, because that would be absurd. Failing to find strong evidence against a single hypothesis has no logical implication that we have found evidence for that hypothesis. Fisher was right about this one. And this is yet another reason NP procedures and rejection trials don’t mix.

However, when we are using concepts of power and Type II errors, we are working with NP procedures which are completely symmetrical and have no concept of strength of evidence per se. Failure to reject the null hypothesis has the exact same meaning as accepting the null hypothesis — they are simply different ways to say the same thing.  If what you want is to measure evidence, fine; I think we should be measuring evidence in any case. But then you don’t have a relevant concept of power, as Fisher has reiterated time and time again. If you want to use power to help plan experiments (as seems to be recommended just about everywhere you look) then you must cast aside your intuitions about interpreting observations from that experiment as evidence. You must reject the rejection trial and reject notions of statistical evidence.

Or don’t, but then you’re swimming in a sea of confusion.

References

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Are all significance tests made of the same stuff?

No! If you are like most of the sane researchers out there, you don’t spend your days and nights worrying about the nuances of different statistical concepts. Especially ones as traditional as these. But there is one concept that I think we should all be aware of: P-values mean very different things to different people. Richard Royall (1997, p. 76-7) provides a smattering of different possible interpretations and fleshes out the arguments for why these mixed interpretations are problematic (much of this post comes from his book):

In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. (Daniel, 1991, p. 192)

A nonsignificant result does not prove that the null hypothesis is correct — merely that it is tenable — our data do not give adequate grounds for rejecting it. (Snedecor and Cochran, 1980, p. 66)

The verdict does not depend on how much more readily some other hypothesis would explain the data. We do not even start to take that question seriously until we have rejected the null hypothesis. …..The statistical significance level is a statement about evidence… If it is small enough, say p = 0.001, we infer that the result is not readily explained as a chance outcome if the null hypothesis is true and we start to look for an alternative explanation with considerable assurance. (Murphy, 1985, p. 120)

If [the p-value] is small, we have two explanations — a rare event has happened, or the assumed distribution is wrong. This is the essence of the significance test argument. Not to reject the null hypothesis … means only that it is accepted for the moment on a provisional basis. (Watson, 1983)

Test of hypothesis. A procedure whereby the truth or falseness of the tested hypothesis is investigated by examining a value of the test statistic computed from a sample and then deciding to reject or accept the tested hypothesis according to whether the value falls into the critical region or acceptance region, respectively. (Remington and Schork, 1970, p. 200)

Although a ‘significant’ departure provides some degree of evidence against a null hypothesis, it is important to realize that a ‘nonsignificant’ departure does not provide positive evidence in favour of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis. (Armitage and Berry, 1987, p. 96)

If that value [of the test statistic] is in the region of rejection, the decision is to reject H0; if that value is outside the region of rejection, the decision is that H0 cannot be rejected at the chosen level of significance … The reasoning behind this decision process is very simple. If the probability associated with the occurance under the null hypothesis of a particular value in the sampling distribution is very small, we may explain the actual occurrence of that value in two ways; first we may explain it by deciding that the null hypothesis is false or, second, we may explain it by deciding that a rare and unlikely event has occurred. (Siegel and Castellan, 1988, Chapter 2)

These all mix and match three distinct viewpoints with regard to hypothesis tests: 1) Neyman-Pearson decision procedures, 2) Fisher’s p-value significance tests, and 3) Fisher’s rejection trials (I think 2 and 3 are sufficiently different to be considered separately). Mixing and matching them is inappropriate, as will be shown below. Unfortunately, they all use the same terms so this can get confusing! I’ll do my best to keep things simple.

1. Neyman-Pearson (NP) decision procedure:
Neyman describes it thusly:

The problem of testing a statistical hypothesis occurs when circumstances force us to make a choice between two courses of action: either take step A or take step B… (Neyman 1950, p. 258)

…any rule R prescribing that we take action A when the sample point … falls within a specified category of points, and that we take action B in all other cases, is a test of a statistical hypothesis. (Neyman 1950, p. 258)

The terms ‘accepting’ and ‘rejecting’ a statistical hypothesis are very convenient and well established. It is important, however, to keep their exact meaning in mind and to discard various additional implications which may be suggested by intuition. Thus, to accept a hypothesis H means only to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true. Also if the application … ‘rejects’ H, this means only that the rule prescribes action B and does not imply that we believe that H is false. (Neyman 1950, p. 259)

So what do we take from this? NP testing is about making a decision to choose H0 or H1, not about shedding light on the truth of any one hypothesis or another. We calculate a test statistic, see where it lies with regard to our predefined rejection regions, and make the corresponding decision. We can assure that we are not often wrong by defining Type I and Type II error probabilities (α and β) to be used in our decision procedure. According to this framework, a good test is one that minimizes these long-run error probabilities. It is important to note that this procedure cannot tell us anything about the truth of hypotheses and does not provide us with a measure of evidence of any kind, only a decision to be made according to our criteria. This procedure is notably symmetric — that is, we can either choose H0 or H1.

Test results would look like this:

α and β were prespecified -based on relevant costs associated with the different errors- for this situation at yadda yadda yadda. The test statistic (say, t=2.5) falls inside the rejection region for H0 defined as t>2.0 so we reject H0 and accept H1.” (Alternatively, you might see “p < α = x so we reject H0. The exact value of p is irrelevant, it is either inside or outside of the rejection region defined by α. Obtaining a p = .04 is effectively equivalent to p = .001 for this procedure, as is obtaining a result very much larger than the critical t above.)

2. Fisher’s p-value significance tests

Fisher’s first procedure is only ever concerned with one hypothesis- that being the null. This procedure is not concerned with making decisions (and when in science do we actually ever do that anyway?) but with measuring evidence against the hypothesis. We want to evaluate ‘the strength of evidence against the hypothesis’ (Fisher, 1958, p.80) by evaluating how rare our particular result (or even bigger results) would be if there were really no effect in the study. Our objective here is to calculate a single number that Fisher called the level of significance, or the p-value. Smaller p is more evidence against the hypothesis than larger p. Increasing levels of significance* are often represented** by more asterisks*** in tables or graphs. More asterisks mean lower p-values, and presumably more evidence against the null.

What is the rationale behind this test? There are only two possible interpretations of our low p: either a rare event has occurred, or the underlying hypothesis is false. Fisher doesn’t think the former is reasonable, so we should assume the latter (Bakan, 1966).

Note that this procedure is directly trying to measure the truth value of a hypothesis. Lower ps indicate more evidence against the hypothesis. This is based on the Law of Improbability, that is,

Law of Improbability: If hypothesis A implies that the probability that a random variable X takes on the value x is quite small, say p(x), then the observation X = x is evidence against A, and the smaller p(x), the stronger the evidence. (Royall, 1997, p. 65)

In a future post I will attempt to show why this law is not a valid indicator of evidence. For the purpose of this post we just need to understand the logic behind this test and that it is fundamentally different from NP procedures. This test alone does not provide any guidance with regard to taking action or making a decision, it is intended as a measure of evidence against a hypothesis.

Test results would look like this:

The present results obtain a t value of 2.5, which corresponds to an observed p = .01**. This level of significance is very small and indicates quite strong evidence against the hypothesis of no difference.

3. Fisher’s rejection trials

This is a strange twist on both of the other procedures above, taking elements from each to form a rejection trial. This test is a decision procedure, much like NP procedures, but with only one explicitly defined hypothesis, a la p-value significance tests. The test is most like what psychologists actually use today, framed as two possible decisions, again like NP, but now they are framed in terms of only one hypothesis. Rejection regions are back too, defined as a region of values that have small probability under H0 (i.e., defined by a small α). It is framed as a problem of logic, specifically,

…a process analogous to testing a proposition in formal logic via the argument known as modus tollens, or ‘denying the consequent’: if A implies B, then not-B implies not-A. We can test A by determining whether B is true. If B is false, then we conclude that A is false. But, on the other hand, if B is found to be true we cannot conclude that A is true. That is, A can be proven false by such a test but it cannot be proven true — either we disprove A or we fail to disprove it…. When B is found to be true, so that A survives the test, this result, although not proving A, does seem intuitively to be evidence supporting A. (Royall, 1997, p. 72)

An important caveat is that these tests are probabilistic in nature, so the logical implications aren’t quite right. Nevertheless, rejection trials are what Fisher referred to when he famously said,

Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis… The notion of an error of the so-called ‘second kind,’ due to accepting the null hypothesis ‘when it is false’ … has no meaning with reference to simple tests of significance. (Fisher, 1966)

So there is a major difference from NP — With rejection trials you have a single hypothesis (as opposed to 2) combined with decision rules of “reject the H0 or do not reject H0” (as opposed to reject H0/H1 or accept H0/H1). With rejection trials we are back to making a decision. This test is asymmetric (as opposed to NP which is symmetric) — that is, we can only ever reject H0, never accept it.

While we are making decisions with rejection trials, the decisions have a different meaning than that of NP procedures. In this framework, deciding to reject H0 implies the hypothesis is “inconsistent with the data” or that the data “provide sufficient evidence to cause rejection” of the hypothesis (Royall, 1997, p.74). So rejection trials are intended to be both decision procedures and measures of evidence. Test statistics that fall into smaller α regions are considered stronger evidence, much the same way that a smaller p-value indicates more evidence against the hypothesis. For NP procedures α is simply a property of the test, and choosing a lower one has no evidential meaning per se (although see Mayo, 1996 for a 4th significance procedure — severity testing).

Test results would look like this:

The present results obtain a t = 2.5, p = .01, which is sufficiently strong evidence against H0 to warrant its rejection.

What is the takeaway?

If you aren’t aware of the difference between the three types of hypothesis testing procedures, you’ll find yourself jumbling them all up (Gigerenzer, 2004). If you aren’t careful, you may end up thinking you have a measure of evidence when you actually have a guide to action.

Which one is correct?

Funny enough, I don’t endorse any of them. I contend that p-values never measure evidence (in either p-value procedures or rejection trials) and NP procedures lead to absurdities that I can’t in good faith accept while simultaneously endorsing them.

Why write 2000 words clarifying the nuanced differences between three procedures I think are patently worthless? Well, did you see what I said at the top referring to sane researchers?

A future post is coming that will explicate the criticisms of each procedure, many of the points again coming from Royall’s book.

References

Armitage, P., & Berry, G. (1987). Statistical methods in medical research. Oxford: Blackwell Scientific.

Bakan, D. (1966). The test of significance in psychological research.Psychological bulletin, 66(6), 423.

Daniel, W. W. (1991). Hypothesis testing. Biostatistics: a foundation for analysis in the health sciences5, 191.

Fisher, R. A. (1958).Statistical methods for research workers (13th ed.). New York: Hafner.

Fisher, R. A. (1966). The design of experiments (8th edn.) Oliver and Boyd.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics,33(5), 587-606.

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Murphy, E. A. (1985). A companion to medical statistics. Johns Hopkins University Press.

Neyman, J. (1950). First course in probability and statistic. Published by Henry Holt, 1950.,1.

Remington, R. D., & Schork, M. A. (1970). Statistics with applications to the biological and health sciences.

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Siegel, S. C., & Castellan, J. NJ (1988). Nonparametric statistics for the behavioural sciences. New York, McGraw-Hill.

Snedecor, G. W. WG Cochran. 1980. Statistical Methods. Iowa State Univ. Press, Ames.

Watson, G. S. (1983). Hypothesis testing. Encyclopedia of Statistics in Quality and Reliability.

Can confidence intervals save psychology? Part 1

Maybe, but probably not by themselves. This post was inspired by Christian Jarrett‘s recent post (you should go read it if you missed it), and the resulting twitter discussion. This will likely develop into a series of posts on confidence intervals.

Geoff Cumming is a big proponent of replacing all hypothesis testing with CI reporting. He says we should change the goal to be precise estimation of effects using confidence intervals, with a goal of facilitating future meta-analyses. But do we understand confidence intervals? (More estimation is something I can get behind, but I think there is still room for hypothesis testing.)

In the twitter discussion, Ryne commented, “If 95% of my CIs contain Mu, then there is .95 prob this one does [emphasis mine]. How is that wrong?” It’s wrong for the same reason Bayesian advocates dislike frequency statistics- You cannot assign probabilities to single events or parameters in that framework. The .95 probability is a property of the process of creating CIs in the long-run, it is not associated with any given interval. That means you cannot make any probabilistic claims about this interval containing Mu, or otherwise, this particular hypothesis being true

In the frequency statistics framework, all probabilities are long-run frequencies (i.e., a proportion of times an outcome occurs out of all possible related outcomes). As such, all statements about associated probabilities must be of that nature. If a fair coin has an associated probability of 50% heads, and I flip a fair coin very many times, then in the long-run I will obtain half heads and half tails. In any given next flip there is no associated probability of heads. This flip is either heads (p(H) = 1) or tails (p(H) = 0) and we don’t know which until after we flip.¹ By assigning probabilities to single events the sense of a long-run frequency is lost (i.e., one flip is not a collective of all flips). As von Mises puts it:

Our probability theory [frequency statistics] has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?” (von Mises, 1957, p. 9, quoted in Oakes, 1986, p. 16)

This is why Ryne’s statement was wrong, and this is why there can be no statements of the kind, “X is the probability that these results are due to chance,”² or “There is a 50% chance that the next flip will be heads,” or “This hypothesis is probably false,” when one adopts the frequency statistics framework. All probabilities are long-run frequencies in a relevant “collective.” (Have I beaten this horse to death yet?) It’s counter-intuitive and strange that we cannot speak of any single event or parameter’s probability. But sadly we can’t in this framework, and as such, “There is .95 probability that Mu is captured by this CI,” is a vacuous statement. If you want to assign probabilities to single events and parameters come join us over in Bayesianland (we have cookies).

EDIT 11/17: See Ryne’s post for why he rejects the technical definition for a pragmatic definition.

Notes:

¹But don’t tell Daryl Bem that.

²Often a confused interpretation of the p-value. The correct interpretation is subtly different: “The probability of the obtained (or more extreme) results given chance.” “Given” is the key difference, because here you are assuming chance. How can an analysis assuming chance is true (i.e., p(chance) = 1) lead to a probability statement about chance being false?

References:

Cumming, G. (2013). The new statistics why and how. Psychological science, 0956797613504966.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.