Slides: “Bayesian statistical concepts: A gentle introduction”

I recently gave a talk in Bielefeld, Germany with the title “Bayesian statistical concepts: A gentle introduction.” I had a few people ask for the slides so I figured I would post them here. If you are a regular reader of this blog, it should all look pretty familiar. It was a mesh of a couple of my Understanding Bayes posts, combining “A look at the Likelihood” and the most recent one, “” The main goal was to give the audience an appreciation for the comparative nature of Bayesian statistical evidence, as well as demonstrate how evidence in the sample has to be interpreted in the context of the specific problem. I didn’t go into Bayes factors or posterior estimation because I promised that it would be a simple and easy talk about the basic concepts.

I’m very grateful to JP de Ruiter for inviting me out to Bielefeld to give this talk, in part because it was my first talk ever! I think it went well enough, but there are a lot of things I can improve on; both in terms of slide content and verbal presentation. JP is very generous with his compliments, and he also gave me a lot of good pointers to incorporate for the next time I talk Bayes.

The main narrative of my talk was that we were to draw candies from one of two possible bags and try to figure out which bag we were drawing from. After each of the slides where I proposed the game I had a member of the audience actually come up and play it with me. The candies, bags, and cards were real but the bets were hypothetical. It was a lot of fun. 🙂

Here is a picture JP took during the talk.

Here are the slides. (You can download a pdf copy from here.)

The Bayesian Reproducibility Project

[Edit: There is a now-published Bayesian reanalysis of the RPP. See here.]

The Reproducibility Project was finally published this week in Science, and an outpouring of articles followed. Headlines included “More Than 50% Psychology Studies Are Questionable: Study”, “Scientists Replicated 100 Psychology Studies, and Fewer Than Half Got the Same Results”, and “More than half of psychology papers are not reproducible”.

Are these categorical conclusions warranted? If you look at the paper, it makes very clear that the results do not definitively establish effects as true or false:

After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. (p. 7)

Very well said. The point of this project was not to determine what proportion of effects are “true”. The point of this project was to see what results are replicable in an independent sample. The question arises of what exactly this means. Is an original study replicable if the replication simply matches it in statistical significance and direction? The authors entertain this possibility:

A straightforward method for evaluating replication is to test whether the replication shows a statistically significant effect (P < 0.05) with the same direction as the original study. This dichotomous vote-counting method is intuitively appealing and consistent with common heuristics used to decide whether original studies “worked.” (p. 4)

How did the replications fare? Not particularly well.

Ninety-seven of 100 (97%) effects from original studies were positive results … On the basis of only the average replication power of the 97 original, significant effects [M = 0.92, median (Mdn) = 0.95], we would expect approximately 89 positive results in the replications if all original effects were true and accurately estimated; however, there were just 35 [36.1%; 95% CI = (26.6%, 46.2%)], a significant reduction … (p. 4)

So the replications, being judged on this metric, did (frankly) horribly when compared to the original studies. Only 35 of the studies achieved significance, as opposed to the 89 expected and the 97 total. This gives a success rate of either 36% (35/97) out of all studies, or 39% (35/89) relative to the number of studies expected to achieve significance based on power calculations. Either way, pretty low. These were the numbers that most of the media latched on to.

Does this metric make sense? Arguably not, since the “difference between significant and not significant is not necessarily significant” (Gelman & Stern, 2006). Comparing significance levels across experiments is not valid inference. A non-significant replication result can be entirely consistent with the original effect, and yet count as a failure because it did not achieve significance. There must be a better metric.

The authors recognize this, so they also used a metric that utilized confidence intervals over simple significance tests. Namely, does the confidence interval from the replication study include the originally reported effect? They write,

This method addresses the weakness of the first test that a replication in the same direction and a P value of 0.06 may not be significantly different from the original result. However, the method will also indicate that a replication “fails” when the direction of the effect is the same but the replication effect size is significantly smaller than the original effect size … Also, the replication “succeeds” when the result is near zero but not estimated with sufficiently high precision to be distinguished from the original effect size. (p. 4)

So with this metric a replication is considered successful if the replication result’s confidence interval contains the original effect, and fails otherwise. The replication effect can be near zero, but if the CI is wide enough it counts as a non-failure (i.e., a “success”). A replication can also be quite near the original effect but have high precision, thus excluding the original effect and “failing”.

This metric is very indirect, and their use of scare-quotes around “succeeds” is telling. Roughly 47% of confidence intervals in the replications “succeeded” in capturing the original result. The problem with this metric is obvious: Replications with effects near zero but wide CIs get the same credit as replications that were bang on the original effect (or even larger) with narrow CIs. Results that don’t flat out contradict the original effects count as much as strong confirmations? Why should both of these types of results be considered equally successful?

Based on these two metrics, the headlines are accurate: Over half of the replications “failed”. But these two reproducibility metrics are either invalid (comparing significance levels across experiments) or very vague (confidence interval agreement). They also only offer binary answers: A replication either “succeeds” or “fails”, and this binary thinking leads to absurd conclusions in some cases like those mentioned above. Is replicability really so black and white? I will explain below how I think we should measure replicability in a Bayesian way, with a continuous measure that can find reasonable answers with replication effects near zero with wide CIs, effects near the original with tight CIs, effects near zero with tight CIs, replication effects that go in the opposite direction, and anything in between.

A Bayesian metric of reproducibility

I wanted to look at the results of the reproducibility project through a Bayesian lens. This post should really be titled, “A Bayesian …” or “One Possible Bayesian …” since there is no single Bayesian answer to any question (but those titles aren’t as catchy). It depends on how you specify the problem and what question you ask. When I look at the question of replicability, I want to know if is there evidence for replication success or for replication failure, and how strong that evidence is. That is, should I interpret the replication results as more consistent with the original reported result or more consistent with a null result, and by how much?

Verhagen and Wagenmakers (2014), and Wagenmakers, Verhagen, and Ly (2015) recently outlined how this could be done for many types of problems. The approach naturally leads to computing a Bayes factor. With Bayes factors, one must explicitly define the hypotheses (models) being compared. In this case one model corresponds to a probability distribution centered around the original finding (i.e. the posterior), and the second model corresponds to the null model (effect = 0). The Bayes factor tells you which model the replication result is more consistent with, and larger Bayes factors indicate a better relative fit. So it’s less about obtaining evidence for the effect in general and more about gauging the relative predictive success of the original effects. (footnote 1)

If the original results do a good job of predicting replication results, the original effect model will achieve a relatively large Bayes factor. If the replication results are much smaller or in the wrong direction, the null model will achieve a large Bayes factor. If the result is ambiguous, there will be a Bayes factor near 1. Again, the question is which model better predicts the replication result? You don’t want a null model to predict replication results better than your original reported effect.

A key advantage of the Bayes factor approach is that it allows natural grades of evidence for replication success. A replication result can strongly agree with the original effect model, it can strongly agree with a null model, or it can lie somewhere in between. To me, the biggest advantage of the Bayes factor is it disentangles the two types of results that traditional significance tests struggle with: a result that actually favors the null model vs a result that is simply insensitive. Since the Bayes factor is inherently a comparative metric, it is possible to obtain evidence for the null model over the tested alternative. This addresses my problem I had with the above metrics: Replication results bang on the original effects get big boosts in the Bayes factor, replication results strongly inconsistent with the original effects get big penalties in the Bayes factor, and ambiguous replication results end up with a vague Bayes factor.

Bayes factor methods are often criticized for being subjective, sensitive to the prior, and for being somewhat arbitrary. Specifying the models is typically hard, and sometimes more arbitrary models are chosen for convenience for a given study. Models can also be specified by theoretical considerations that often appear subjective (because they are). For a replication study, the models are hardly arbitrary at all. The null model corresponds to that of a skeptic of the original results, and the alternative model corresponds to a strong theoretical proponent. The models are theoretically motivated and answer exactly what I want to know: Does the replication result fit more with the original effect model or a null model? Or as Verhagen and Wagenmakers (2014) put it, “Is the effect similar to what was found before, or is it absent?” (p.1458 here).

Replication Bayes factors

In the following, I take the effects reported in figure 3 of the reproducibility project (the pretty red and green scatterplot) and calculate replication Bayes factors for each one. Since they have been converted to correlation measures, replication Bayes factors can easily be calculated using the code provided by Wagenmakers, Verhagen, and Ly (2015). The authors of the reproducibility project kindly provide the script for making their figure 3, so all I did was take the part of the script that compiled the converted 95 correlation effect sizes for original and replication studies. (footnote 2) The replication Bayes factor script takes the correlation coefficients from the original studies as input, calculates the corresponding original effect’s posterior distribution, and then compares the fit of this distribution and the null model to the result of the replication. Bayes factors larger than 1 indicate the original effect model is a better fit, Bayes factors smaller than 1 indicate the null model is a better fit. Large (or really small) Bayes factors indicate strong evidence, and Bayes factors near 1 indicate a largely insensitive result.

The replication Bayes factors are summarized in the figure below (click to enlarge). The y-axis is the count of Bayes factors per bin, and the different bins correspond to various strengths of replication success or failure. Results that fall in the bins left of center constitute support the null over the original result, and vice versa. The outer-most bins on the left or right contain the strongest replication failures and successes, respectively. The bins labelled “Moderate” contain the more muted replication successes or failures. The two central-most bins labelled “Insensitive” contain results that are essentially uninformative.

So how did we do?

You’ll notice from this crude binning system that there is quite a spread from super strong replication failure to super strong replication success. I’ve committed the sin of binning a continuous outcome, but I think it serves as a nice summary. It’s important to remember that Bayes factors of 2.5 vs 3.5, while in different bins, aren’t categorically different. Bayes factors of 9 vs 11, while in different bins, aren’t categorically different. Bayes factors of 15 and 90, while in the same bin, are quite different. There is no black and white here. These are the categories Bayesians often use to describe grades of Bayes factors, so I use them since they are familiar to many readers. If you have a better idea for displaying this please leave a comment. 🙂 Check out the “Results” section at the end of this post to see a table which shows the study number, the N in original and replications, the r values of each study, the replication Bayes factor and category I gave it, and the replication p-value for comparison with the Bayes factor. This table shows the really wide spread of the results. There is also code in the “Code” section to reproduce the analyses.

Strong replication failures and strong successes

Roughly 20% (17 out of 95) of replications resulted in relatively strong replication failures (2 left-most bins), with resultant Bayes factors at least 10:1 in favor of the null. The highest Bayes factor in this category was over 300,000 (study 110, “Perceptual mechanisms that characterize gender differences in decoding women’s sexual intent”). If you were skeptical of these original effects, you’d feel validated in your skepticism after the replications. If you were a proponent of the original effects’ replicability you’ll perhaps want to think twice before writing that next grant based around these studies.

Roughly 25% (23 out of 95) of replications resulted in relatively strong replication successes (2 right-most bins), with resultant Bayes factors at least 10:1 in favor of the original effect. The highest Bayes factor in this category was 1.3×10^32 (or log(bf)=74; study 113, “Prescribed optimism: Is it right to be wrong about the future?”) If you were a skeptic of the original effects you should update your opinion to reflect the fact that these findings convincingly replicated. If you were a proponent of these effects you feel validation in that they appear to be robust.

These two types of results are the most clear-cut: either the null is strongly favored or the original reported effect is strongly favored. Anyone who was indifferent to these effects has their opinion swayed to one side, and proponents/skeptics are left feeling either validated or starting to re-evaluate their position. There was only 1 very strong (BF>100) failure to replicate but there were quite a few very strong replication successes (16!). There were approximately twice as many strong (10<BF<100) failures to replicate (16) than strong replication successes (7).

Moderate replication failures and moderate successes

The middle-inner bins are labelled “Moderate”, and contain replication results that aren’t entirely convincing but are still relatively informative (3<BF<10). The Bayes factors in the upper end of this range are somewhat more convincing than the Bayes factors in the lower end of this range.

Roughly 20% (19 out of 95) of replications resulted in moderate failures to replicate (third bin from the left), with resultant Bayes factors between 10:1 and 3:1 in favor of the null. If you were a proponent of these effects you’d feel a little more hesitant, but you likely wouldn’t reconsider your research program over these results. If you were a skeptic of the original effects you’d feel justified in continued skepticism.

Roughly 10% (9 out of 95) of replications resulted in moderate replication successes (third bin from the right), with resultant Bayes factors between 10:1 and 3:1 in favor of the original effect. If you were a big skeptic of the original effects, these replication results likely wouldn’t completely change your mind (perhaps you’d be a tad more open minded). If you were a proponent, you’d feel a bit more confident.

Many uninformative “failed” replications

The two central bins contain replication results that are insensitive. In general, Bayes factors smaller than 3:1 should be interpreted only as very weak evidence. That is, these results are so weak that they wouldn’t even be convincing to an ideal impartial observer (neither proponent nor skeptic). These two bins contain 27 replication results. Approximately 30% of the replication results from the reproducibility project aren’t worth much inferentially!

A few examples:

• Study 2, “Now you see it, now you don’t: repetition blindness for nonwords” BF = 2:1 in favor of null
• Study 12, “When does between-sequence phonological similarity promote irrelevant sound disruption?” BF = 1.1:1 in favor of null
• Study 80, “The effects of an implemental mind-set on attitude strength.” BF = 1.2:1 in favor of original effect
• Study 143, “Creating social connection through inferential reproduction: Loneliness and perceived agency in gadgets, gods, and greyhounds” BF = 2:1 in favor of null

I just picked these out randomly. The types of replication studies in this inconclusive set range from attentional blink (study 2), to brain mapping studies (study 55), to space perception (study 167), to cross national comparisons of personality (study 154).

Should these replications count as “failures” to the same extent as the ones in the left 2 bins? Should studies with a Bayes factor of 2:1 in favor of the original effect count as “failures” as much as studies with 50:1 against? I would argue they should not, they should be called what they are: entirely inconclusive.

Interestingly, study 143 mentioned above was recently called out in this NYT article as a high-profile study that “didn’t hold up”. Actually, we don’t know if it held up! Identifying replications that were inconclusive using this continuous range helps avoid over-interpreting ambiguous results as “failures”.

Wrap up

To summarize the graphic and the results discussed above, this method identifies roughly as many replications with moderate success or better (BF>3) as the counting significance method (32 vs 35). (footnote 3) These successes can be graded based on their replication Bayes factor as moderate to very strong. The key insight from using this method is that many replications that “fail” based on the significance count are actually just inconclusive. It’s one thing to give equal credit to two replication successes that are quite different in strength, but it’s another to call all replications failures equally bad when they show a highly variable range. Calling a replication a failure when it is actually inconclusive has consequences for the original researcher and the perception of the field.

As opposed to the confidence interval metric, a replication effect centered near zero with a wide CI will not count as a replication success with this method; it would likely be either inconclusive or weak evidence in favor of the null. Some replications are indeed moderate to strong failures to replicate (36 or so), but nearly 30% of all replications in the reproducibility project (27 out of 95) were not very informative in choosing between the original effect model and the null model.

So to answer my question as I first posed it, are the categorical conclusions of wide-scale failures to replicate by the media stories warranted? As always, it depends.

• If you count “success” as any Bayes factor that has any evidence in favor of the original effect (BF>1), then there is a 44% success rate (42 out of 95).
• If you count “success” as any Bayes factor with at least moderate evidence in favor of the original effect (BF>3), then there is a 34% success rate (32 out of 95).
• If you count  “failure” as any Bayes factor that has at least moderate evidence in favor of the null (BF<1/3), then there is a 38% failure rate (36 out of 95).
• If you only consider the effects sensitive enough to discriminate the null model and the original effect model (BF>3 or BF<1/3) in your total, then there is a roughly 47% success rate (32 out of 68). This number jives (uncannily) well with the prediction John Ioannidis made 10 years ago (47%).

However you judge it, the results aren’t exactly great.

But if we move away from dichotomous judgements of replication success/failure, we see a slightly less grim picture. Many studies strongly replicated, many studies strongly failed, but many studies were in between. There is a wide range! Judgements of replicability needn’t be black and white. And with more data the inconclusive results could have gone either way.  I would argue that any study with 1/3<BF<3 shouldn’t count as a failure or a success, since the evidence simply is not convincing; I think we should hold off judging these inconclusive effects until there is stronger evidence. Saying “we didn’t learn much about this or that effect” is a totally reasonable thing to do. Boo dichotomization!

Try out this method!

All in all, I think the Bayesian approach to evaluating replication success is advantageous in 3 big ways: It avoids dichotomizing replication outcomes, it gives an indication of the range of the strength of replication successes or failures, and it identifies which studies we need to give more attention to (insensitive BFs). The Bayes factor approach used here can straighten out when a replication shows strong evidence in favor of the null model, strong evidence in favor of the original effect model, or evidence that isn’t convincingly in favor of either position. Inconclusive replications should be targeted for future replication, and perhaps we should look into why these studies that purport to have high power (>90%) end up with insensitive results (large variance, design flaw, overly optimistic power calcs, etc). It turns out that having high power in planning a study is no guarantee that one actually obtains convincingly sensitive data (Dienes, 2014; Wagenmakers et al., 2014).

I should note, the reproducibility project did try to move away from the dichotomous thinking about replicability by correlating the converted effect sizes (r) between original and replication studies. This was a clever idea, and it led to a very pretty graph (figure 3) and some interesting conclusions. That idea is similar in spirit to what I’ve laid out above, but its conclusions can only be drawn from batches of replication results. Replication Bayes factors allow one to compare the original and replication results on an effect by effect basis. This Bayesian method can grade a replication on its relative success or failure even if your reproducibility project only has 1 effect in it.

I should also note, this analysis is inherently context dependent. A different group of studies could very well show a different distribution of replication Bayes factors, where each individual study has a different prior distribution (based on the original effect). I don’t know how much these results would generalize to other journals or other fields, but I would be interested to see these replication Bayes factors employed if systematic replication efforts ever do catch on in other fields.

Acknowledgements and thanks

The authors of the reproducibility project have done us all a great service and I am grateful that they have shared all of their code, data, and scripts. This re-analysis wouldn’t have been possible without their commitment to open science. I am also grateful to EJ Wagenmakers, Josine Verhagen, and Alexander Ly for sharing the code to calculate the replication Bayes factors on the OSF. Many thanks to Chris Engelhardt and Daniel Lakens for some fruitful discussions when I was planning this post. Of course, the usual disclaimer applies and all errors you find should be attributed only to me.

Notes

footnote 1: Of course, a model that takes publication bias into account could fit better by tempering the original estimate, and thus show relative evidence for the bias-corrected effect vs either of the other models; but that’d be answering a different question than the one I want to ask.

footnote 2: I left out 2 results that I couldn’t get to work with the calculations. Studies 46 and 139, both appear to be fairly strong successes, but I’ve left them out of the reported numbers because I couldn’t calculate a BF.

footnote 3: The cutoff of BF>3 isn’t a hard and fast rule at all. Recall that this is a continuous measure. Bayes factors are typically a little more conservative than significance tests in supporting the alternative hypothesis. If the threshold for success is dropped to BF>2 the number of successes is 35 — an even match with the original estimate.

Results

This table is organized from smallest replication Bayes factor to largest (i.e., strongest evidence in favor of null to strongest evidence in favor of original effect). The Ns were taken from the final columns in the master data sheet,”T_N_O_for_tables” and “T_N_R_for_tables”. Some Ns are not integers because they presumably underwent df correction. There is also the replication p-value for comparison; notice that BFs>3 generally correspond to ps less than .05 — BUT there are some cases where they do not agree. If you’d like to see more about the studies you can check out the master data file in the reproducibility project OSF page (linked below).

R Code

If you want to check/modify/correct my code, here it is. If you find a glaring error please leave a comment below or tweet at me 🙂

References

Link to the reproducibility project OSF

Link to replication Bayes factors OSF

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in psychology, 5.

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328-331.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 28 August 2015: 349 (6251), aac4716 [DOI:10.1126/science.aac4716]

Verhagen, J., & Wagenmakers, E. J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General,143(4), 1457.

Wagenmakers, E. J., Verhagen, A. J., & Ly, A. (in press). How to quantify the evidence for the absence of a correlation. Behavior Research Methods.

Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., … & Morey, R. D. (2014). A power fallacy. Behavior research methods, 1-5.

Understanding Bayes: Visualization of the Bayes Factor

In the first post of the Understanding Bayes series I said:

The likelihood is the workhorse of Bayesian inference. In order to understand Bayesian parameter estimation you need to understand the likelihood. In order to understand Bayesian model comparison (Bayes factors) you need to understand the likelihood and likelihood ratios.

I’ve shown in another post how the likelihood works as the updating factor for turning priors into posteriors for parameter estimation. In this post I’ll explain how using Bayes factors for model comparison can be conceptualized as a simple extension of likelihood ratios.

There’s that coin again

Imagine we’re in a similar situation as before: I’ve flipped a coin 100 times and it came up 60 heads and 40 tails. The likelihood function for binomial data in general is:

$\ P \big(X = x \big) \propto \ p^x \big(1-p \big)^{n-x}$

and for this particular result:

$\ P \big(X = 60 \big) \propto \ p^{60} \big(1-p \big)^{40}$

The corresponding likelihood curve is shown below, which displays the relative likelihood for all possible simple (point) hypotheses given this data. Any likelihood ratio can be calculated by simply taking the ratio of the different hypotheses’s heights on the curve.

In that previous post I compared the fair coin hypothesis — H0: P(H)=.5 — vs one particular trick coin hypothesis — H1: P(H)=.75. For 60 heads out of 100 tosses, the likelihood ratio for these hypotheses is L(.5)/L(.75) = 29.9. This means the data are 29.9 times as probable under the fair coin hypothesis than this particular trick coin hypothesisBut often we don’t have theories precise enough to make point predictions about parameters, at least not in psychology. So it’s often helpful if we can assign a range of plausible values for parameters as dictated by our theories.

Enter the Bayes factor

Calculating a Bayes factor is a simple extension of this process. A Bayes factor is a weighted average likelihood ratio, where the weights are based on the prior distribution specified for the hypotheses. For this example I’ll keep the simple fair coin hypothesis as the null hypothesis — H0: P(H)=.5 — but now the alternative hypothesis will become a composite hypothesis — H1: P(θ). (footnote 1) The likelihood ratio is evaluated at each point of P(θ) and weighted by the relative plausibility we assign that value. Then once we’ve assigned weights to each ratio we just take the average to get the Bayes factor. Figuring out how the weights should be assigned (the prior) is the tricky part.

Imagine my composite hypothesis, P(θ), is a combination of 21 different point hypotheses, all evenly spaced out between 0 and 1 and all of these points are weighted equally (not a very realistic hypothesis!). So we end up with P(θ) = {0, .05, .10, .15, . . ., .9, .95, 1}. The likelihood ratio can be evaluated at every possible point hypothesis relative to H0, and we need to decide how to assign weights. This is easy for this P(θ); we assign zero weight for every likelihood ratio that is not associated with one of the point hypotheses contained in P(θ), and we assign weights of 1 to all likelihood ratios associated with the 21 points in P(θ).

This gif has the 21 point hypotheses of P(θ) represented as blue vertical lines (indicating where we put our weights of 1), and the turquoise tracking lines represent the likelihood ratio being calculated at every possible point relative to H0: P(H)=.5. (Remember, the likelihood ratio is the ratio of the heights on the curve.) This means we only care about the ratios given by the tracking lines when the dot attached to the moving arm aligns with the vertical P(θ) lines. [edit: this paragraph added as clarification]

The 21 likelihood ratios associated with P(θ) are:

{~0, ~0, ~0, ~0, ~0, ~0, ~0, ~0, .002, .08, 1, 4.5, 7.5, 4.4, .78, .03, ~0, ~0, ~0, ~0, ~0}

Since they are all weighted equally we simply average, and obtain BF = 18.3/21 = .87. In other words, the data (60 heads out of 100) are 1/.87 = 1.15 times more probable under the null hypothesis — H0: P(H)=.5 — than this particular composite hypothesis — H1: P(θ). Entirely uninformative! Despite tossing the coin 100 times we have extremely weak evidence that is hardly worth even acknowledging. This happened because much of P(θ) falls in areas of extremely low likelihood relative to H0, as evidenced by those 13 zeros above. P(θ) is flexible, since it covers the entire possible range of θ, but this flexibility comes at a price. You have to pay for all of those zeros with a lower weighted average and a smaller Bayes factor.

Now imagine I had seen a trick coin like this before, and I know it had a slight bias towards landing heads. I can use this information to make more pointed predictions. Let’s say I define P(θ) as 21 equally weighted point hypotheses again, but this time they are all equally spaced between .5 and .75, which happens to be the highest density region of the likelihood curve (how fortuitous!). Now P(θ) = {.50, .5125, .525, . . ., .7375, .75}.

The 21 likelihood ratios associated with the new P(θ) are:

{1.00, 1.5, 2.1, 2.8, 4.5, 5.4, 6.2, 6.9, 7.5, 7.3, 6.9, 6.2, 4.4, 3.4, 2.6, 1.8, .78, .47, .27, .14, .03}

They are all still weighted equally, so the simple average is BF = 72/21 = 3.4. Three times more informative than before, and in favor of P(θ) this time! And no zeros. We were able to add theoretically relevant information to H1 to make more accurate predictions, and we get rewarded with a Bayes boost. (But this result is only 3-to-1 evidence, which is still fairly weak.)

This new P(θ) is risky though, because if the data show a bias towards tails or a more extreme bias towards heads then it faces a very heavy penalty (many more zeros). High risk = high reward with the Bayes factor. Make pointed predictions that match the data and get a bump to your BF, but if you’re wrong then pay a steep price. For example, if the data were 60 tails instead of 60 heads the BF would be 10-to-1 against P(θ) rather than 3-to-1 for P(θ)!

Now, typically people don’t actually specify hypotheses like these. Typically they use continuous distributions, but the idea is the same. Take the likelihood ratio at each point relative to H0, weigh according to plausibilities given in P(θ), and then average.

A more realistic (?) example

Imagine you’re walking down the sidewalk and you see a shiny piece of foreign currency by your feet. You pick it up and want to know if it’s a fair coin or an unfair coin. As a Bayesian you have to be precise about what you mean by fair and unfair. Fair is typically pretty straightforward — H0: P(H)=.5 as before — but unfair could mean anything. Since this is a completely foreign coin to you, you may want to be fairly open-minded about it. After careful deliberation, you assign P(θ) a beta distribution, with shape parameters 10 and 10. That is, H1: P(θ) ~ Beta(10, 10). This means that if the coin isn’t fair, it’s probably close to fair but it could reasonably be moderately biased, and you have no reason to think it is particularly biased to one side or the other.

Now you build a perfect coin-tosser machine and set it to toss 100 times (but not any more than that because you haven’t got all day). You carefully record the results and the coin comes up 33 heads out of 100 tosses. Under which hypothesis are these data more probable, H0 or H1? In other words, which hypothesis did the better job predicting these data?

This may be a continuous prior but the concept is exactly the same as before: weigh the various likelihood ratios based on the prior plausibility assignment and then average. The continuous distribution on P(θ) can be thought of as a set of many many point hypotheses spaced very very close together. So if the range of θ we are interested in is limited to 0 to 1, as with binomials and coin flips, then a distribution containing 101 point hypotheses spaced .01 apart, can effectively be treated as if it were continuous. The numbers will be a little off but all in all it’s usually pretty close. So imagine that instead of 21 hypotheses you have 101, and their relative plausibilities follow the shape of a Beta(10, 10). (footnote 2)

Since this is not a uniform distribution, we need to assign varying weights to each likelihood ratio. Each likelihood ratio associated with a point in P(θ) is simply multiplied by the respective density assigned to it under P(θ). For example, the density of P(θ) at .4 is 2.44. So we multiply the likelihood ratio at that point, L(.4)/L(.5) = 128, by 2.44, and add it to the accumulating total likelihood ratio. Do this for every point and then divide by the total number of points, in this case 101, to obtain the approximate Bayes factor. The total weighted likelihood ratio is 5564.9, divide it by 101 to get 55.1, and there’s the Bayes factor. In other words, the data are roughly 55 times more probable under this composite H1 than under H0. The alternative hypothesis H1 did a much better job predicting these data than did the null hypothesis H0.

The actual Bayes factor is obtained by integrating the likelihood with respect to H1’s density distribution and then dividing by the (marginal) likelihood of H0. Essentially what it does is cut P(θ) into slices infinitely thin before it calculates the likelihood ratios, re-weighs, and averages. That Bayes factor comes out to 55.7, which is basically the same thing we got through this ghetto visualization demonstration!

Take home

The take-home message is hopefully pretty clear at this point: When you are comparing a point null hypothesis with a composite hypothesis, the Bayes factor can be thought of as a weighted average of every point hypothesis’s likelihood ratio against H0, and the weights are determined by the prior density distribution of H1. Since the Bayes factor is a weighted average based on the prior distribution, it’s really important to think hard about the prior distribution you choose for H1. In a previous post, I showed how different priors can converge to the same posterior with enough data. The priors are often said to “wash out” in estimation problems like that. This is not necessarily the case for Bayes factors. The priors you choose matter, so think hard!

Notes

Footnote 1: A lot of ink has been spilled arguing about how one should define P(θ). I talked about it a little a previous post.

Footnote 2: I’ve rescaled the likelihood curve to match the scale of the prior density under H1. This doesn’t affect the values of the Bayes factor or likelihood ratios because the scaling constant cancels itself out.

Understanding Bayes: Updating priors via the likelihood

[Some material from this post has been incorporated into a paper to be published in AMPPS]

In a previous post I outlined the basic idea behind likelihoods and likelihood ratios. Likelihoods are relatively straightforward to understand because they are based on tangible data. Collect your data, and then the likelihood curve shows the relative support that your data lend to various simple hypotheses. Likelihoods are a key component of Bayesian inference because they are the bridge that gets us from prior to posterior.

In this post I explain how to use the likelihood to update a prior into a posterior. The simplest way to illustrate likelihoods as an updating factor is to use conjugate distribution families (Raiffa & Schlaifer, 1961). A prior and likelihood are said to be conjugate when the resulting posterior distribution is the same type of distribution as the prior. This means that if you have binomial data you can use a beta prior to obtain a beta posterior. If you had normal data you could use a normal prior and obtain a normal posterior. Conjugate priors are not required for doing bayesian updating, but they make the calculations a lot easier so they are nice to use if you can.

I’ll use some data from a recent NCAA 3-point shooting contest to illustrate how different priors can converge into highly similar posteriors.

The data

This year’s NCAA shooting contest was a thriller that saw Cassandra Brown of the Portland Pilots win the grand prize. This means that she won the women’s contest and went on to defeat the men’s champion in a shoot-off. This got me thinking, just how good is Cassandra Brown?

What a great chance to use some real data in a toy example. She completed 4 rounds of shooting, with 25 shots in each round, for a total of 100 shots (I did the math). The data are counts, so I’ll be using the binomial distribution as a data model (i.e., the likelihood. See this previous post for details). Her results were the following:

Round 1: 13/25               Round 2: 12/25               Round 3: 14/25               Round 4: 19/25

Total: 58/100

The likelihood curve below encompasses the entirety of statistical evidence that our 3-point data provide (footnote 1). The hypothesis with the most relative support is .58, and the curve is moderately narrow since there are quite a few data points. I didn’t standardize the height of the curve in order to keep it comparable to the other curves I’ll be showing.

The prior

Now the part that people often make a fuss about: choosing the prior. There are a few ways to choose a prior. Since I am using a binomial likelihood, I’ll be using a conjugate beta prior. A beta prior has two shape parameters that determine what it looks like, and is denoted Beta(α, β). I like to think of priors in terms of what kind of information they represent. The shape parameters α and β can be thought of as prior observations that I’ve made (or imagined).

Imagine my trusted friend caught the end of Brown’s warm-up and saw her take two shots, making one and missing the other, and she tells me this information. This would mean I could reasonably use the common Beta(1, 1) prior, which represents a uniform density over [0, 1]. In other words, all possible values for Brown’s shooting percentage are given equal weight before taking data into account, because the only thing I know about her ability is that both outcomes are possible (Lee & Wagenmakers, 2005).

Another common prior is called Jeffreys’s prior, a Beta(1/2, 1/2) which forms a wide bowl shape. This prior would be recommended if you had extremely scarce information about Brown’s ability. Is Brown so good that she makes nearly every shot, or is she so bad that she misses nearly every shot? This prior says that Brown’s shooting rate is probably near the extremes, which may not necessarily reflect a reasonable belief for someone who is a college basketball player, but it has the benefit of having less influence on the posterior estimates than the uniform prior (since it is equal to 1 prior observation instead of 2). Jeffreys’s prior is popular because it has some desirable properties, such as invariance under parameter transformation (Jaynes, 2003). So if instead of asking about Brown’s shooting percentage I instead wanted to know her shooting percentage squared or cubed, Jeffreys’s prior would remain the same shape while many other priors would drastically change shape.

Or perhaps I had another trusted friend who had arrived earlier and seen Brown take her final 13 shots in warm-up, and she saw 4 makes and 9 misses. Then I could use a Beta(4, 9) prior to characterize this prior information, which looks like a hump over .3 with density falling slowly as it moves outward in either direction. This prior has information equivalent to 13 shots, or roughly an extra 1/2 round of shooting.

These three different priors are shown below.

These are but three possible priors one could use. In your analysis you can use any prior you want, but if you want to be taken seriously you’d better give some justification for it. Bayesian inference allows many rules for prior construction.”This is my personal prior” is a technically a valid reason, but if this is your only justification then your colleagues/reviewers/editors will probably not take your results seriously.

Updating the prior via the likelihood

Now for the easiest part. In order to obtain a posterior, simply use Bayes’s rule:

$\ Posterior \propto Likelihood \ X \ Prior$

The posterior is proportional to the likelihood multiplied by the prior. What’s nice about working with conjugate distributions is that Bayesian updating really is as simple as basic algebra. We take the formula for the binomial likelihood, which from a previous post is known to be:

$\ Likelihood \ = \ p^x \big(1-p \big)^{n-x}$

and then multiply it by the formula for the beta prior with α and β shape parameters:

$\ Prior \ = \ p^{\alpha-1} \big(1-p \big)^{\beta-1}$

to obtain the following formula for the posterior:

$\ Posterior \ = \ p^x \big(1-p \big)^{n-x} p^{\alpha-1} \big(1-p \big)^{\beta-1}$

With a little bit of algebra knowledge, you’ll know that multiplying together terms with the same base means the exponents can be added together. So the posterior formula can be rewritten as:

$\ Posterior \ = \ p^x p^{\alpha-1}\big(1-p \big)^{n-x} \big(1-p \big)^{\beta-1}$

and then by adding the exponents together the formula simplifies to:

$\ Posterior \ = \ p^{\alpha-1+x} \big(1-p \big)^{\beta-1+n-x}$

and it’s that simple! Take the prior, add the successes and failures to the different exponents, and voila. The distributional notation is even simpler. Take the prior, Beta(α, β), and add the successes from the data, x, to α and the failures, n – x, to β, and there’s your posterior, Beta(α+x, β+n-x).

Remember from the previous post that likelihoods don’t care about what order the data arrive in, it always results in the same curve. This property of likelihoods is carried over to posterior updating. The formulas above serve as another illustration of this fact. It doesn’t matter if you add a string of six single data points, 1+1+1+1+1+1+1 or a batch of +6 data points; the posterior formula in either case ends up with 6 additional points in the exponents.

Looking at some posteriors

Back to Brown’s shooting data. She had four rounds of shooting so I’ll treat each round as a batch of new data. Her results for each round were: 13/25, 12/25, 14/25, 19/25. I’ll show how the different priors are updated with each batch of data. A neat thing about bayesian updating is that after batch 1 is added to the initial prior, its posterior is used as the prior for the next batch of data. And as the formulas above indicate, the order or frequency of additions doesn’t make a difference on the final posterior. I’ll verify this at the end of the post.

In the following plots, the prior is shown in blue (as above), the likelihood in orange (as above), and the resulting posteriors after Brown’s first 13/25 makes in purple.

In the first and second plot the likelihood is nearly invisible because the posterior sits right on top of it. When the prior has only 1 or 2 data points worth of information, it has essentially no impact on the posterior shape (footnote 2). The third plot shows how the posterior splits the difference between the likelihood and the informed prior based on the relative quantity of information in each.

The posteriors obtained from the uniform and Jeffreys’s priors suggest the best guess for Brown’s shooting percentage is around 50%, whereas the posterior obtained from the informed prior suggests it is around 40%. No surprise here since the informed prior represents another 1/2 round of shots where Brown performed poorly, which shifts the posterior towards lower values. But all three posteriors are still quite broad, and the breadth of the curves can be thought to represent the uncertainty in my estimates. More data -> tighter curves -> less uncertainty.

Now I’ll add the second round performance as a new likelihood (12/25 makes), and I’ll take the posteriors from the first round of updating as new priors for the second round of updating. So the purple posteriors from the plots above are now blue priors, the likelihood is orange again, and the new posteriors are purple.

The left two plots look nearly identical, which should be no surprise since their posteriors were essentially equivalent after only 1 round of data updates. The third plot shows a posterior still slightly shifted to the left of the others, but it is much more in line with them than before. All three posteriors are getting narrower as more data is added.

The last two rounds of updating are shown below, again with posteriors from the previous round taken as priors for the next round. At this point they’ve all converged to very similar posteriors that are much narrower, translating to less uncertainty in my estimates.

These posterior distributions look pretty similar now! Just as an illustration, I’ll show what happens when I update the initial priors with all of the data at once.

As the formulas predict, the posteriors after one big batch of data are identical to those obtained by repeatedly adding multiple smaller batches of data. It’s also a little easier to see the discrepancies between the final posteriors in this illustration because the likelihood curve acts as a visual anchor. The uniform and Jeffreys’s priors result in posteriors that essentially fall right on top of the likelihood, whereas the informed prior results in a posterior that is very slightly shifted to the left of the likelihood.

My takeaway from these posteriors is that Cassandra Brown has a pretty damn good 3-point shot! In a future post I’ll explain how to use this method of updating to make inferences using Bayes factors. It’s called the Savage-Dickey density method, and I think it’s incredibly intuitive and easy to use.

Notes:

Footnote 1: I’m making a major assumption about the data: Any one shot is exchangeable with any other shot. This might not be defensible since the final ball on each rack is worth a bonus point, so maybe those shots differ systematically from regular shots, but it’s a toy example so I’ll ignore that possibility. There’s also the possibility of her going on a hot streak, a.k.a. having a “hot hand”, but I’m going to ignore that too because I’m the one writing this blog post and I want to keep it simple. There’s also the possibility that she gets worse throughout the competition because she gets tired, but then there’s also the possibility that she gets better as she warms up with multiple rounds. All of these things are reasonable to consider and I am going to ignore them all.

Footnote 2: There is a tendency to call any priors that have very little impact on the posterior “non-informative”, but, as I mentioned in the section on determining priors, uniform priors that seem non-informative in one context can become highly informative with parameter transformation (Zhu & Lu, 2004). Jeffreys’s prior was derived precisely with that in mind, so it carries little information no matter what transformation is applied.

References:

Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge University Press.

Lee, M. D., & Wagenmakers, E. J. (2005). Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112(3), 662-668.

Raiffa, H. & Schlaifer, R. (1961). Applied statistical decision theory. Division of Research, Graduate School of Business Administration, Harvard University.

Zhu, M., & Lu, A. Y. (2004). The counter-intuitive non-informative prior for the Bernoulli family. Journal of Statistics Education, 12(2), 1-10.

Type-S and Type-M errors

An anonymous reader of the blog emailed me:
–
I wonder if you’d be ok to help me to understanding this Gelman’s I struggle to understand what is the plotted distribution and the exact meaning of the red area. Of course I read the related article, but it doesn’t help me much.
Rather than write a long-winded email, I figured it will be easier to explain on the blog using some step by step illustrations. With the anonymous reader’s permission I am sharing the question and this explanation for all to read. The graph in question is reproduced below. I will walk through my explanation by building up to this plot piecewise with the information we have about the specific situation referenced in the related paper. The paper, written by Andrew Gelman and John Carlin, illustrates the concepts of Type-M errors and Type-S errors. From the paper:
We frame our calculations not in terms of Type 1 and Type 2 errors but rather Type S (sign) and Type M (magnitude) errors, which relate to the probability that claims with confidence have the wrong sign or are far in magnitude from underlying effect sizes (p. 2)
So Gelman’s graph is an attempt to illustrate these types of errors. I won’t go into the details of the paper since you can read it yourself! I was asked to explain this graph though, which isn’t in the paper, so we’ll go through step by step building our own type-s/m graph in order to build an understanding. The key idea is this: if the underlying true population mean is small and sampling error is large, then experiments that achieve statistical significance must have exaggerated effect sizes and are likely to have the wrong sign. The graph in question:
A few technical details: Here Gelman is plotting a sampling distribution for a hypothetical experiment. If one were to repeatedly take a sample from a population, then each sample mean would be different from the true population mean by some amount due to random variation. When we run an experiment, we essentially pick a sample mean from this distribution at random. Picking at random, sample means tend to be near the true mean of the population, and the how much these random sample means vary follows a curve like this. The height of the curve represents the relative frequency for a sample mean in a series of random picks. Obtaining sample means far away from the true mean is relatively rare since the height of the curve is much lower the farther out we go from the population mean. The red shaded areas indicate values of sample means that achieve statistical significance (i.e., exceed some critical value).
–
The distribution’s form is determined by two parameters: a location parameter and a scale parameter. The location parameter is simply the mean of the distribution (μ), and the scale parameter is the standard deviation of the distribution (σ). In this graph, Gelman defines the true population mean to be 2 based on his experience in this research area; the standard deviation is equal to the sampling error (standard error) of our procedure, which in this case is approximately 8.1 (estimated from empirical data; for more information see the paper, p. 6). The extent of variation in sample means is determined by the amount of sampling error present in our experiment. If measurements are noisy, or if the sample is small, or both, then sampling error goes up. This is reflected in a wider sampling distribution. If we can refine our measurements, or increase our sample size, then sampling error goes down and we see a narrower sampling distribution (smaller value of σ).

Let’s build our own Type-S and Type-M graph

In Gelman’s graph the mean of the population is 2, and this is indicated by the vertical blue line at the peak of the curve. Again, this hypothetical true value is determined by Gelman’s experience with the topic area. The null hypothesis states that the true mean of the population is zero, and this is indicated by the red vertical line. The hypothetical sample mean from Gelman’s paper is 17, which I’ve added as a small grey diamond near the x-axis. R code to make all figures is provided at the end of this post (except the gif).
If we assume that the true population mean is actually zero (indicated by the red vertical line), instead of 2, then the sampling distribution has a location parameter of 0 and a scale parameter of 8.1. This distribution is shown below. The diamond representing our sample mean corresponds to a fairly low height on the curve, indicating that it is relatively rare to obtain such a result under this sampling distribution.
Next we need to define cutoffs for statistically significant effects (the red shaded areas under the curve in Gelman’s plot) using the null value combined with the sampling error of our procedure. Since this is a two-sided test using an alpha of 5%, we have one cutoff for significance at approximately -15.9 (i.e., 0 – [1.96 x 8.1]) and the other cutoff at approximately 15.9 (i.e., 0 + [1.96 x 8.1]). Under the null sampling distribution, the shaded areas are symmetrical. If we obtain a sample mean that lies beyond these cutoffs we declare our result statistically significant by conventional standards. As you can see, the diamond representing our sample mean of 17 is just beyond this cutoff and thus achieves statistical significance.
But Gelman’s graph assumes the population mean is actually 2, not zero. This is important because we can’t actually have a sign error or a magnitude error if there isn’t a true sign or magnitude. We can adjust the curve so that the peak is above 2 by shifting it over slightly to the right. The shaded areas begin in the same place on the x-axis as before (+/- 15.9), but notice that they have become asymmetrical. This is due to the fact that we shifted the entire distribution slightly to the right, shrinking the left shaded area and expanding the right shaded area.
And there we have our own beautiful type-s and type-m graph. Since the true population mean is small and positive, any sample mean falling in the left tail has the wrong sign and vastly overestimates the population mean (-15.9 vs. 2). Any sample mean falling in the right tail has the correct sign, but again vastly overestimates the population mean (15.9 vs. 2). Our sample mean falls squarely in the right shaded tail. Since the standard error of this procedure (8.1) is much larger than the true population mean (2), any statistically significant result must have a sample mean that is much larger in magnitude than the true population mean, and is quite likely to have the wrong sign.
In this case the left tail contains 24% of the total shaded area under the curve, so in repeated sampling a full 24% of significant results will be in the wrong tail (and thus be a sign error). If the true population mean were still positive but larger in magnitude then the shaded area in the left tail would become smaller and smaller, as it did when we shifted the true population mean from zero to 2, and thus sign errors would be less of a problem. As Gelman and Carlin summarize,
setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7. Thus, it is quite likely that a study designed in this way would lead to an estimate that is in the wrong direction, and if “significant,” it is likely to be a huge overestimate of the pattern in the population. (p. 6)
Here is a neat gif showing our progression! Thanks for reading 🙂

(I don’t think this disclaimer is needed but here it goes: I don’t think people should actually use repeated-sampling statistical inference. This is simply an explanation of the concept. Be a Bayesian!)

Understanding Bayes: A Look at the Likelihood

[This post has been updated and turned into a paper to be published in AMPPS]

Much of the discussion in psychology surrounding Bayesian inference focuses on priors. Should we embrace priors, or should we be skeptical? When are Bayesian methods sensitive to specification of the prior, and when do the data effectively overwhelm it? Should we use context specific prior distributions or should we use general defaults? These are all great questions and great discussions to be having.

One thing that often gets left out of the discussion is the importance of the likelihood. The likelihood is the workhorse of Bayesian inference. In order to understand Bayesian parameter estimation you need to understand the likelihood. In order to understand Bayesian model comparison (Bayes factors) you need to understand the likelihood and likelihood ratios.

What is likelihood?

Likelihood is a funny concept. It’s not a probability, but it is proportional to a probability. The likelihood of a hypothesis (H) given some data (D) is proportional to the probability of obtaining D given that H is true, multiplied by an arbitrary positive constant (K). In other words, L(H|D) = K · P(D|H). Since a likelihood isn’t actually a probability it doesn’t obey various rules of probability. For example, likelihood need not sum to 1.

A critical difference between probability and likelihood is in the interpretation of what is fixed and what can vary. In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H|D), conditions on the data as if they are fixed while allowing the hypotheses to vary.

The distinction is subtle, so I’ll say it again. For conditional probability, the hypothesis is treated as a given and the data are free to vary. For likelihood, the data are a given and the hypotheses vary.

The Likelihood Axiom

Edwards (1992, p. 30) defines the Likelihood Axiom as a natural combination of the Law of Likelihood and the Likelihood Principle.

The Law of Likelihood states that “within the framework of a statistical model, a particular set of data supports one statistical hypothesis better than another if the likelihood of the first hypothesis, on the data, exceeds the likelihood of the second hypothesis” (Emphasis original. Edwards, 1992, p. 30).

In other words, there is evidence for H1 vis-a-vis H2 if and only if the probability of the data under H1 is greater than the probability of the data under H2. That is, D is evidence for H1 over H2 if P(D|H1) >  P(D|H2). If these two probabilities are equivalent, then there is no evidence for either hypothesis over the other. Furthermore, the strength of the statistical evidence for H1 over H2 is quantified by the ratio of their likelihoods, L(H1|D)/L(H2|D) (which again is proportional to P(D|H1)/P(D|H2) up to an arbitrary constant that cancels out).

The Likelihood Principle states that the likelihood function contains all of the information relevant to the evaluation of statistical evidence. Other facets of the data that do not factor into the likelihood function are irrelevant to the evaluation of the strength of the statistical evidence (Edwards, 1992, p. 30; Royall, 1997, p. 22). They can be meaningful for planning studies or for decision analysis, but they are separate from the strength of the statistical evidence.

Likelihoods are meaningless in isolation

Unlike a probability, a likelihood has no real meaning per se due to the arbitrary constant. Only by comparing likelihoods do they become interpretable, because the constant in each likelihood cancels the other one out. The easiest way to explain this aspect of likelihood is to use the binomial distribution as an example.

Suppose I flip a coin 10 times and it comes up 6 heads and 4 tails. If the coin were fair, p(heads) = .5, the probability of this occurrence is defined by the binomial distribution:

$\ P \big(X = x \big) = \binom{n}{x} p^x \big(1-p \big)^{n-x}$

where x is the number of heads obtained, n is the total number of flips, p is the probability of heads, and

$\binom{n}{x} = \frac{n!}{x! (n-x)!}$

Substituting in our values we get

$\ P \big(X = 6 \big) = \frac{10!}{6! (4!)} \big(.5 \big)^6 \big(1-.5 \big)^{4} \approx .21$

If the coin were a trick coin, so that p(heads) = .75, the probability of 6 heads in 10 tosses is:

$\ P \big(X = 6 \big) = \frac{10!}{6! (4!)} \big(.75 \big)^6 \big(1-.75 \big)^{4} \approx .15$

To quantify the statistical evidence for the first hypothesis against the second, we simply divide one probability by the other. This ratio tells us everything we need to know about the support the data lends to one hypothesis vis-a-vis the other.  In the case of 6 heads in 10 tosses, the likelihood ratio (LR) for a fair coin vs our trick coin is:

$LR = \Bigg(\frac{10!}{6! (4!)} \big(.5 \big)^6 \big(1-.5 \big)^4 \Bigg) \div \Bigg(\frac{10!}{6! (4!)} \big(.75 \big)^6 \big(1-.75 \big)^4 \Bigg) \approx .21/.15 \approx 1.4$

Translation: The data are 1.4 times as probable under a fair coin hypothesis than under this particular trick coin hypothesis. Notice how the first terms in each of the equations above, i.e., $\frac{10!}{6! (4!)}$, are equivalent and completely cancel each other out in the likelihood ratio.

Same data. Same constant. Cancel out.

The first term in the equations above, $\frac{10!}{6! (4!)}$, details our journey to obtaining 6 heads out of 10. If we change our journey (i.e., different sampling plan) then this changes the term’s value, but crucially, since it is the same term in both the numerator and denominator it always cancels itself out. In other words, the information contained in the way the data are obtained disappears from the function. Hence the irrelevance of the stopping rule to the evaluation of statistical evidence, which is something that makes bayesian and likelihood methods valuable and flexible.

If we leave out the first term in the above calculations, our numerator is L(.5) = 0.0009765625 and our denominator is L(.75) ≈ 0.0006952286. Using these values to form the likelihood ratio we get: 0.0009765625/0.0006952286 ≈ 1.4, as we should since the other terms simply cancelled out before.

Again I want to reiterate that the value of a single likelihood is meaningless in isolation; only in comparing likelihoods do we find meaning.

Looking at likelihoods

Likelihoods may seem overly restrictive at first. We can only compare 2 simple statistical hypotheses in a single likelihood ratio. But what if we are interested in comparing many more hypotheses at once? What if we want to compare all possible hypotheses at once?

In that case we can plot the likelihood function for our data, and this lets us ‘see’ the evidence in its entirety. By plotting the entire likelihood function we compare all possible hypotheses simultaneously. The Likelihood Principle tells us that the likelihood function encompasses all statistical evidence that our data can provide, so we should always plot this function along side our reported likelihood ratios.

Following the wisdom of Birnbaum (1962), “the “evidential meaning” of experimental results is characterized fully by the likelihood function” (as cited in Royall, 1997, p.25). So let’s look at some examples. The R script at the end of this post can be used to reproduce these plots, or you can use it to make your own plots. Play around with it and see how the functions change for different number of heads, total flips, and hypotheses of interest. See the instructions in the script for details.

Below is the likelihood function for 6 heads in 10 tosses. I’ve marked our two hypotheses from before on the likelihood curve with blue dots. Since the likelihood function is meaningful only up to an arbitrary constant, the graph is scaled by convention so that the best supported value (i.e., the maximum) corresponds to a likelihood of 1.

The vertical dotted line marks the hypothesis best supported by the data. The likelihood ratio of any two hypotheses is simply the ratio of their heights on this curve. We can see from the plot that the fair coin has a higher likelihood than our trick coin.

How does the curve change if instead of 6 heads out of 10 tosses, we tossed 100 times and obtained 60 heads?

Our curve gets much narrower! How did the strength of evidence change for the fair coin vs the trick coin? The new likelihood ratio is L(.5)/L(.75) ≈ 29.9. Much stronger evidence!(footnote) However, due to the narrowing, neither of these hypothesized values are very high up on the curve anymore. It might be more informative to compare each of our hypotheses against the best supported hypothesis. This gives us two likelihood ratios: L(.6)/L(.5) ≈ 7.5 and L(.6)/L(.75) ≈ 224.

Here is one more curve, for when we obtain 300 heads in 500 coin flips.

Notice that both of our hypotheses look to be very near the minimum of the graph. Yet their likelihood ratio is much stronger than before. For this data the likelihood ratio L(.5)/L(.75) is nearly 24 million! The inherent relativity of evidence is made clear here: The fair coin was supported when compared to one particular trick coin. But this should not be interpreted as absolute evidence for the fair coin, because the likelihood ratio for the maximally supported hypothesis vs the fair coin, L(.6)/L(.5), is nearly 24 thousand!

We need to be careful not to make blanket statements about absolute support, such as claiming that the maximum is “strongly supported by the data”. Always ask, “Compared to what?” The best supported hypothesis will be only be weakly supported vs any hypothesis just before or just after it on the x-axis. For example, L(.6)/L(.61) ≈ 1.1, which is barely any support one way or the other. It cannot be said enough that evidence for a hypothesis must be evaluated in consideration with a specific alternative.

Connecting likelihood ratios to Bayes factors

Bayes factors are simple extensions of likelihood ratios. A Bayes factor is a weighted average likelihood ratio based on the prior distribution specified for the hypotheses. (When the hypotheses are simple point hypotheses, the Bayes factor is equivalent to the likelihood ratio.) The likelihood ratio is evaluated at each point of the prior distribution and weighted by the probability we assign that value. If the prior distribution assigns the majority of its probability to values far away from the observed data, then the average likelihood for that hypothesis is lower than one that assigns probability closer to the observed data. In other words, you get a Bayes boost if you make more accurate predictions. Bayes factors are extremely valuable, and in a future post I will tackle the hard problem of assigning priors and evaluating weighted likelihoods.

I hope you come away from this post with a greater knowledge of, and appreciation for, likelihoods. Play around with the R code and you can get a feel for how the likelihood functions change for different data and different hypotheses of interest.

(footnote) Obtaining 60 heads in 100 tosses is equivalent to obtaining 6 heads in 10 tosses 10 separate times. To obtain this new likelihood ratio we can simply multiply our ratios together. That is, raise the first ratio to the power of 10; 1.4^10 ≈ 28.9, which is just slightly off from the correct value of 29.9 due to rounding.

References

Birnbaum, A. (1962). On the foundations of statistical inference. Journal of the American Statistical Association, 57(298), 269-306.

Edwards, A. W. (1992). Likelihood, expanded ed. Johns Hopkins University Press.

Royall, R. (1997). Statistical evidence: A likelihood paradigm (Vol. 71). CRC press.

Edwards, Lindman, and Savage (1963) on why the p-value is still so dominant

Below is an excerpt from Edwards, Lindman, and Savage (1963, pp. 236-7), on why p-value procedures continue to be dominant in the empirical sciences even after it has been repeatedly shown to be an incoherent and nonsensical statistic (note: those are my choice of words, the authors are very cordial in their commentary). The age of the article shows in numbers 1 and 2, but I think it is still valuable commentary; Numbers 3 and 4 are still highly relevant today.

From Edwards, Lindman, and Savage (1963, pp. 236-7):

If classical significance tests have rather frequently rejected true null hypotheses without real evidence, why have they survived so long and so dominated certain empirical sciences ? Four remarks seem to shed some light on this important and difficult question.

1. In principle, many of the rejections at the .05 level are based on values of the test statistic far beyond the borderline, and so correspond to almost unequivocal evidence [i.e., passing the interocular trauma test]. In practice, this argument loses much of its force. It has become customary to reject a null hypothesis at the highest significance level among the magic values, .05, .01, and .001, which the test statistic permits, rather than to choose a significance level in advance and reject all hypotheses whose test statistics fall beyond the criterion value specified by the chosen significance level. So a .05 level rejection today usually means that the test statistic was significant at the .05 level but not at the .01 level. Still, a test statistic which falls just short of the .01 level may correspond to much stronger evidence against a null hypothesis than one barely significant at the .05 level. …

2. Important rejections at the .05 or .01 levels based on test statistics which would not have been significant at higher levels are not common. Psychologists tend to run relatively large experiments, and to get very highly significant main effects. The place where .05 level rejections are most common is in testing interactions in analyses of variance—and few experimenters take those tests very seriously, unless several lines of evidence point to the same conclusions. [emphasis added]

3. Attempts to replicate a result are rather rare, so few null hypothesis rejections are subjected to an empirical check. When such a check is performed and fails, explanation of the anomaly almost always centers on experimental design, minor variations in technique, and so forth, rather than on the meaning of the statistical procedures used in the original study.

4. Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data […] Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. […]The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses. [emphasis added]

References

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological review, 70(3), 193-242.