# Sunday Bayes: Testing precise hypotheses

First and foremost, when testing precise hypotheses, formal use of P-values should be abandoned. Almost anything will give a better indication of the evidence provided by the data against Ho.

–Berger & Delampady, 1987 (pdf link)

### Sunday Bayes series intro:

After the great response to the eight easy steps paper we posted, I started a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format is short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next. This paper was voted to be the next in the series.

(I changed the series name to Sunday Bayes, since I’ll be posting these on every Sunday.)

### Testing precise hypotheses

This would indicate that say, claiming that a P-value of .05 is significant evidence against a precise hypothesis is sheer folly; the actual Bayes factor may well be near 1, and the posterior probability of Ho near 1/2 (p. 326)

Berger and Delampady (pdf link) review the background and standard practice for testing point null hypotheses (i.e., “precise hypotheses”). The paper came out nearly 30 years ago, so some parts of the discussion may not be as relevant these days, but it’s still a good paper.

They start by reviewing the basic measures of evidence — p-values, Bayes factors, posterior probabilities — before turning to an example. Rereading it, I remember why we gave this paper one of the highest difficulty ratings in the eight steps paper. There is a lot of technical discussion in this paper, but luckily I think most of the technical bits can be skipped in lieu of reading their commentary.

One of the main points of this paper is to investigate precisely when it is appropriate to approximate a small interval null hypothesis by using a point null hypothesis. They conclude, that most of the time, the error of approximation for Bayes factors will be small (<10%),

these numbers suggest that the point null approximation to Ho will be reasonable so long as [the width of the null interval] is one-half a [standard error] in width or smaller. (p. 322)

A secondary point of this paper is to refute the claim that classical answers will typically agree with some “objective” Bayesian analyses. Their conclusion is that such a claim

is simply not the case in the testing of precise hypotheses. This is indicated in Table 1 where, for instance, P(Ho | x) [NB: the posterior probability of the null] is from 5 to 50 times larger than the P-value. (p. 318)

They also review some lower bounds on the amount of Bayesian evidence that corresponds to significant p-values. They sum up their results thusly,

The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

and

the weighted likelihood of H1 is at most [2.5] times that of Ho. A likelihood ratio [NB: Bayes factor] of [2.5] is not particularly strong evidence, particularly when it is [an upper] bound. However, it is customary in practice to view [p] = .05 as strong evidence against Ho. A P-value of [p] = .01, often considered very strong evidence against Ho, corresponds to [BF] = .1227, indicating that H1 is at most 8 times as likely as Ho. The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

### A few choice quotes

Page 319:

[A common opinion is that if] θ0 [NB: a point null] is not in [a confidence interval] it can be rejected, and looking at the set will provide a good indication as to the actual magnitude of the difference between θ and θ0. This opinion is wrong, because it ignores the supposed special nature of θo. A point can be outside a 95% confidence set, yet not be so strongly contraindicated by the data. Only by calculating a Bayes factor … can one judge how well the data supports a distinguished point θ0.

Page 327:

Of course, every statistician must judge for himself or herself how often precise hypotheses actually occur in practice. At the very least, however, we would argue that all types of tests should be able to be properly analyzed by statistics

Page 327 (emphasis original, since that text is a subheading):

[It is commonly argued that] The P-Value Is Just a Data Summary, Which We Can Learn To Properly Calibrate … One can argue that, through experience, one can learn how to interpret P-values. … But if the interpretation depends on Ho, the sample size, the density and the stopping rule, all in crucial ways, it becomes ridiculous to argue that we can intuitively learn to properly calibrate P-values.

page 328:

we would urge reporting both the Bayes factor, B, against [H0] and a confidence or credible region, C. The Bayes factor communicates the evidence in the data against [H0], and C indicates the magnitude of the possible discrepancy.

Page 328:

Without explicit alternatives, however, no Bayes factor or posterior probability could be calculated. Thus, the argument goes, one has no recourse but to use the P-value. A number of Bayesian responses to this argument have been raised … here we concentrate on responding in terms of the discussion in this paper. If, indeed, it is the case that P-values for precise hypotheses essentially always drastically overstate the actual evidence against Ho when the alternatives are known, how can one argue that no problem exists when the alternatives are not known?

### Vote for the next entry:

1. Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
2. Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
3. Gallistel (2009) — The Importance of Proving the Null (pdf)
4. Lindley (2000) — The philosophy of statistics (pdf)

# A Bayesian perspective on the Reproducibility Project: Psychology

It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.

–Harold Jeffreys, 1939

Joachim Vandekerckhove (@VandekerckhoveJ) and I have just published a Bayesian reanalysis of the Reproducibility Project: Psychology in PLOS ONE (CLICK HERE). It is open access, so everyone can read it! Boo paywalls! Yay open access! The review process at PLOS ONE was very nice; we had two rounds of reviews that really helped us clarify our explanations of the method and results.

Oh and it got a new title: “A Bayesian perspective on the Reproducibility Project: Psychology.” A little less presumptuous than the old blog’s title. Thanks to the RPP authors sharing all of their data, we research parasites were able to find some interesting stuff. (And thanks Richard Morey (@richarddmorey) for making this great badge)

TLDR: One of the main takeaways from the paper is the following: We shouldn’t be too surprised when psychology experiments don’t replicate, given the evidence in the original studies is often unacceptably weak to begin with!

### What did we do?

Here is the abstract from the paper:

We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.

In the paper we try to answer four questions: 1) How much evidence is there in the original studies? 2) If we account for the possibility of publication bias, how much evidence is left in the original studies? 3) How much evidence is there in the replication studies? 4) How consistent is the evidence between (bias-corrected) original studies and replication studies?

We implement a very neat technique called Bayesian model averaging to account for publication bias in the original studies. The method is fairly technical, so I’ve put the topic in the Understanding Bayes queue (probably the next post in the series). The short version is that each Bayes factor consists of eight likelihood functions that get weighted based on the potential bias in the original result. There are details in the paper, and much more technical detail in this paper (Guan and Vandekerckhove, 2015). Since the replication studies would be published regardless of outcome, and were almost certainly free from publication bias, we can calculate regular (bias free) Bayes factors for them.

### Results

There are only 8 studies where both the bias mitigated original Bayes factors and the replication Bayes factors are above 10 (highlighted with the blue hexagon). That is, both experiment attempts provide strong evidence. It may go without saying, but I’ll say it anyway: These are the ideal cases.

(The prior distribution for all Bayes factors is a normal distribution with mean of zero and variance of one. All the code is online HERE if you’d like to see how different priors change the result; our sensitivity analysis didn’t reveal any major dependencies on the exact prior used.)

The majority of studies (46/72) have both bias mitigated original and replication Bayes factors in the 1/10< BF <10 range (highlighted with the red box). These are cases where both study attempts only yielded weak evidence.

Overall, both attempts for most studies provided only weak evidence. There is a silver/bronze/rusty-metal lining, in that when both study attempts obtain only weak Bayes factors, they are technically providing consistent amounts of evidence. But that’s still bad, because “consistency” just means that we are systematically gathering weak evidence!

Using our analysis, no studies provided strong evidence that favored the null  hypothesis in either the original or replication.

It is interesting to consider the cases where one study attempt found strong evidence but another did not. I’ve highlighted these cases in blue in the table below. What can explain this?

One might be tempted to manufacture reasons that explain this pattern of results, but before you do that take a look at the figure below. We made this figure to highlight one common aspect of all study attempts that find weak evidence in one attempt and strong evidence in another: Differences in sample size. In all cases where the replication found strong evidence and the original study did not, the replication attempt had the larger sample size. Likewise, whenever the original study found strong evidence and the replication did not, the original study had a larger sample size.

Figure 2. Evidence resulting from replicated studies plotted against evidence resulting from the original publications. For the original publications, evidence for the alternative hypothesis was calculated taking into account the possibility of publication bias. Small crosses indicate cases where neither the replication nor the original gave strong evidence. Circles indicate cases where one or the other gave strong evidence, with the size of each circle proportional to the ratio of the replication sample size to the original sample size (a reference circle appears in the lower right). The area labeled ‘replication uninformative’ contains cases where the original provided strong evidence but the replication did not, and the area labeled ‘original uninformative’ contains cases where the reverse was true. Two studies that fell beyond the limits of the figure in the top right area (i.e., that yielded extremely large Bayes factors both times) and two that fell above the top left area (i.e., large Bayes factors in the replication only) are not shown. The effect that relative sample size has on Bayes factor pairs is shown by the systematic size difference of circles going from the bottom right to the top left. All values in this figure can be found in S1 Table.

### Abridged conclusion (read the paper for more! More what? Nuance, of course. Bayesians are known for their nuance…)

Even when taken at face value, the original studies frequently provided only weak evidence when analyzed using Bayes factors (i.e., BF < 10), and as you’d expect this already small amount of evidence shrinks even more when you take into account the possibility of publication bias. This has a few nasty implications. As we say in the paper,

In the likely event that [the original] observed effect sizes were inflated … the sample size recommendations from prospective power analysis will have been underestimates, and thus replication studies will tend to find mostly weak evidence as well.

According to our analysis, in which a whopping 57 out of 72 replications had 1/10 < BF < 10, this appears to have been the case.

We also should be wary of claims about hidden moderators. We put it like this in the paper,

The apparent discrepancy between the original set of results and the outcome of the Reproducibility Project can be adequately explained by the combination of deleterious publication practices and weak standards of evidence, without recourse to hypothetical hidden moderators.

Of course, we are not saying that hidden moderators could not have had an influence on the results of the RPP. The statement is merely that we can explain the results reasonably well without necessarily bringing hidden moderators into the discussion. As Laplace would say: We have no need of that hypothesis.

So to sum up,

From a Bayesian reanalysis of the Reproducibility Project: Psychology, we conclude that one reason many published effects fail to replicate appears to be that the evidence for their existence was unacceptably weak in the first place.

With regard to interpretation of results — I will include the same disclaimer here that we provide in the paper:

It is important to keep in mind, however, that the Bayes factor as a measure of evidence must always be interpreted in the light of the substantive issue at hand: For extraordinary claims, we may reasonably require more evidence, while for certain situations—when data collection is very hard or the stakes are low—we may satisfy ourselves with smaller amounts of evidence. For our purposes, we will only consider Bayes factors of 10 or more as evidential—a value that would take an uninvested reader from equipoise to a 91% confidence level. Note that the Bayes factor represents the evidence from the sample; other readers can take these Bayes factors and combine them with their own personal prior odds to come to their own conclusions.

All of the results are tabulated in the supplementary materials (HERE) and the code is on github (CODE HERE).

### More disclaimers, code, and differences from the old reanalysis

Disclaimer:

All of the results are tabulated in a table in the supplementary information (link), and MATLAB code to reproduce the results and figures is provided online (CODE HERE). When interpreting these results, we use a Bayes factor threshold of 10 to represent strong evidence. If you would like to see how the results change when using a different threshold, all you have to do is change the code in line 118 of the ‘bbc_main.m’ file to whatever thresholds you prefer.

#######

Important note: The function to calculate the mitigated Bayes factors is a prototype and is not robust to misuse. You should not use it unless you know what you are doing!

#######

A few differences between this paper and an old reanalysis:

A few months back I posted a Bayesian reanalysis of the Reproducibility Project: Psychology, in which I calculated replication Bayes factors for the RPP studies. This analysis took the posterior distribution from the original studies as the prior distribution in the replication studies to calculate the Bayes factor. So in that calculation, the hypotheses being compared are: H_0 “There is no effect” vs. H_A “The effect is close to that found by the original study.” It also did not take into account publication bias.

This is important: The published reanalysis is very different from the one in the first blog post.

Since the posterior distributions from the original studies were usually centered on quite large effects, the replication Bayes factors could fall in a wide range of values. If a replication found a moderately large effect, comparable to the original, then the Bayes factor would very largely favor H_A. If the replication found a small-to-zero effect (or an effect in the opposite direction), the Bayes factor would very largely favor H_0. If the replication found an effect in the middle of the two hypotheses, then the Bayes factor would be closer to 1, meaning the data fit both hypotheses equally bad. This last case happened when the replications found effects in the same direction as the original studies but of smaller magnitude.

These three types of outcomes happened with roughly equal frequency; there were lots of strong replications (big BF favoring H_A), lots of strong failures to replicate (BF favoring H_0), and lots of ambiguous results (BF around 1).

The results in this new reanalysis are not as extreme because the prior distribution for H_A is centered on zero, which means it makes more similar predictions to H_0 than the old priors. Whereas roughly 20% of the studies in the first reanalysis were strongly in favor of H_0 (BF>10), that did not happen a single time in the new reanalysis. This new analysis also includes the possibility of a biased publication processes, which can have a large effect on the results.

We use a different prior so we get different results. Hence the Jeffreys quote at the top of the page.

# The next steps: Jerome Cornfield and sequential analysis

This is equivalent to saying that if the application of a principle to given evidence leads to an absurdity then the evidence must be discarded. It is reminiscent of the heavy smoker, who, worried by the literature relating smoking to lung cancer, decided to give up reading.

— Cornfield, 1966 (pdf link)

### The next steps series intro:

After the great response to the eight easy steps paper we posted, I have decided to start a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format will be short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next.

### Sequential trials, sequential analysis and the likelihood principle

Theoretical focus, low difficulty

Cornfield (1966) begins by posing a question:

Do the conclusions to be drawn from any set of data depend only on the data or do they depend also on the stopping rule which led to the data? (p. 18)

The purpose of his paper is to discuss this question and explore the implications of answering “yes” versus “no.” This paper is a natural followup to entries one and three in the eight easy steps paper.

If you have read the eight easy steps paper (or at least the first and third steps), you’ll know that the answer to the above question for classical statistics is “yes”, while the answer for Bayesian statistics is “no.”

Cornfield introduces a concepts he calls the “α-postulate,” which states,

All hypotheses rejected at the same critical level [i.e., p<.05] have equal amounts of evidence against them. (p. 19)

Through a series of examples, Cornfield shows that the α-postulate appears to be false.

Cornfield then introduces a concept called the likelihood principle, which comes up in a few of the eight easy steps entries. The likelihood principle says that the likelihood function contains all of the information relevant to the evaluation of statistical evidence. Other facets of the data that do not factor into the likelihood function are irrelevant to the evaluation of the strength of the statistical evidence.

He goes on to show how subscription to the likelihood principle minimizes a linear combination of type-I (α) and type-II (β) error rates, as opposed to the Neyman-Pearson procedure that minimizes type-II error rates (i.e., maximizes power) for a fixed type-I error rate (usually 5%).

Thus, if instead of minimizing β for a given α, we minimize [their linear combination], we must come to the same conclusion for all sample points which have the same likelihood function, no matter what the design. (p. 21)

### A few choice quotes

page 19 (emphasis added):

The following example will be recognized by statisticians with consulting experience as a simplified version of a very common situation. An experimenter, having made n observations in the expectation that they would permit the rejection of a particular hypothesis, at some predesignated significance level, say .05, finds that he has not quite attained this critical level. He still believes that the hypothesis is false and asks how many more observations would be required to have reasonable certainty of rejecting the hypothesis if the means observed after n observations are taken as the true values. He also makes it clear that had the original n observations permitted rejection he would simply have published his findings. Under these circumstances it is evident that there is no amount of additional observation, no matter how large, which would permit rejection at the .05 level. If the hypothesis being tested is true, there is a .05 chance of its having been rejected after the first round of observations. To this chance must be added the probability of rejecting after the second round, given failure to reject after the first, and this increases the total chance of erroneous rejection to above .05. In fact … no amount of additional evidence can be collected which would provide evidence against the hypothesis equivalent to rejection at the P =.05 level

page 19-20 (emphasis added):

I realize, of course, that practical people tend to become impatient with counter-examples of this type. Quite properly they regard principles as only approximate guides to practice, and not as prescriptions that must be literally followed even when they lead to absurdities. But if one is unwilling to be guided by the α-postulate in the examples given, why should he be any more willing to accept it when analyzing sequential trials? The biostatistician’s responsibility for providing biomedical scientists with a satisfactory explication of inference cannot, in my opinion, be satisfied by applying certain principles when he agrees with their consequences and by disregarding them when he doesn’t.

page 22 (emphasis added):

The stopping rule is this: continue observations until a normal mean differs from the hypothesized value by k standard errors, at which point stop. It is certain, using the rule, that one will eventually differ from the hypothesized value by at least k standard errors even when the hypothesis is true. … The Bayesian viewpoint of the example is as follows. If one is seriously concerned about the probability that a stopping rule will certainly result in the rejection of a true hypothesis, it must be because some possibility of the truth of the hypothesis is being entertained. In that case it is appropriate to assign a non-zero prior probability to the hypothesis. If this is done, differing from the hypothesized value by k standard errors will not result in the same posterior probability for the hypothesis for all values of n. In fact for fixed k the posterior probability of the hypothesis monotonically approaches unity as n increases, no matter how small the prior probability assigned, so long as it is non-zero, and how large the k, so long as it is finite. Differing by k standard errors does not therefore necessarily provide any evidence against the hypothesis and disregarding the stopping rule does not lead to an absurd conclusion. The Bayesian viewpoint thus indicates that the hypothesis is certain to be erroneously rejected-not because the stopping rule was disregarded-but because the hypothesis was assigned zero prior probability and that such assignment is inconsistent with concern over the possibility that the hypothesis will certainly be rejected when true.

### Vote for the next entry:

1. Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
2. Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
3. Gallistel (2009) — The Importance of Proving the Null (pdf)
4. Berger and Delampady (1987) — Testing Precise Hypotheses (pdf)

# Understanding Bayes: How to become a Bayesian in eight easy steps

### How to become a Bayesian in eight easy steps: An annotated reading list

(TLDR: We wrote an annotated reading list to get you started in learning Bayesian statistics. Published version. .)

It can be hard to know where to start when you want to learn about Bayesian statistics. I am frequently asked to share my favorite introductory resources to Bayesian statistics, and my go-to answer has been to share a dropbox folder with a bunch of PDFs that aren’t really sorted or cohesive. In some sense I was acting as little more than a glorified Google Scholar search bar.

It seems like there is some tension out there with regard to Bayes, in that many people want to know more about it, but when they pick up, say, Andrew Gelman and colleagues’ Bayesian Data Analysis they get totally overwhelmed. And then they just think, “Screw this esoteric B.S.” and give up because it doesn’t seem like it is worth their time or effort.

I think this happens a lot. Introductory Bayesian texts usually assume a level of training in mathematical statistics that most researchers simply don’t have time (or otherwise don’t need) to learn. There are actually a lot of accessible Bayesian resources out there that don’t require much math stat background at all, but it just so happens that they are not consolidated anywhere so people don’t necessarily know about them.

### Enter the eight step program

Beth Baribault, Peter Edelsbrunner (@peter1328), Fabian Dablander (@fdabl), Quentin Gronau, and I have just finished a new paper that tries to remedy this situation, titled, “How to become a Bayesian in eight easy steps: An annotated reading list.” We were invited to submit this paper for a special issue on Bayesian statistics for Psychonomic Bulletin and Review. Each paper in the special issue addresses a specific question we often hear about Bayesian statistics, and ours was the following:

I am a reviewer/editor handling a manuscript that uses Bayesian methods; which articles should I read to get a quick idea of what that means?

So the paper‘s goal is not so much to teach readers how to actually perform Bayesian data analysis — there are other papers in the special issue for that — but to facilitate readers in their quest to understand basic Bayesian concepts. We think it will serve as a nice introductory reading list for any interested researcher.

The format of the paper is straightforward. We highlight eight papers that had a big impact on our own understanding of Bayesian statistics, as well as short descriptions of an additional 28 resources in the Further reading appendix. The first four papers are focused on theoretical introductions, and the second four have a slightly more applied focus.

We also give every resource a ranking from 1–9 on two dimensions: Focus (theoretical vs. applied) and Difficulty (easy vs. hard). We tried to provide a wide range of resources, from easy applications (#14: Wagenmakers, Lee, and Morey’s “Bayesian benefits for the pragmatic researcher”) to challenging theoretical discussions (#12: Edwards, Lindman and Savage’s “Bayesian statistical inference for psychological research”) and others in between.

The figure below (Figure A1, available on the last page of the paper) summarizes our rankings:

The emboldened numbers (1–8) are the papers that we’ve commented on in detail, numbers in light text (9–30) are papers we briefly describe in the appendix, and the italicized numbers (31–36) are our recommended introductory books (also listed in the appendix).

This is how we chose to frame the paper,

Overall, the guide is designed such that a researcher might be able to read all eight of the highlighted articles and some supplemental readings within a few days. After readers acquaint themselves with these sources, they should be well-equipped both to interpret existing research and to evaluate new research that relies on Bayesian methods.

### The list

Here’s the list of papers we chose to cover in detail:

1.  Lindley (1993): The analysis of experimental data: The appreciation of tea and wine. PDF.
2. Kruschke (2015, chapter 2): Introduction: Credibility, models, and parameters. Available on the DBDA website.
3. Dienes (2011): Bayesian versus orthodox statistics: Which side are you on? PDF.
4. Rouder, Speckman, Sun, Morey, & Iverson (2009): Bayesian t tests for accepting and rejecting the null hypothesis. PDF.
5. Vandekerckhove, Matzke, & Wagenmakers (2014): Model comparison and the principle of parsimony. PDF.
6. van de Schoot, Kaplan, Denissen, Asendorpf, Neyer, & Aken (2014): A gentle introduction to Bayesian analysis: Applications to developmental research. PDF.
7. Lee and Vanpaemel (from the same special issue): Determining priors for cognitive models. PDF.
8. Lee (2008): Three case studies in the Bayesian analysis of cognitive models. PDF.

You’ll have to check out the paper to see our commentary and to find out what other articles we included in the Further reading appendix. We provide urls (web archived when possible; archive.org/web/) to PDFs of the eight main papers (except #2, that’s on the DBDA website), and wherever possible for the rest of the resources (some did not have free copies online; see the References).

I thought this was a fun paper to write, and if you think you might want to learn some Bayesian basics I hope you will consider reading it.

Oh, and I should mention that we wrote the whole paper collaboratively on Overleaf.com. It is a great site that makes it easy to get started using LaTeX, and I highly recommend trying it out.

This is the fifth post in the Understanding Bayes series. Until next time,

# Slides: “Bayesian statistical concepts: A gentle introduction”

I recently gave a talk in Bielefeld, Germany with the title “Bayesian statistical concepts: A gentle introduction.” I had a few people ask for the slides so I figured I would post them here. If you are a regular reader of this blog, it should all look pretty familiar. It was a mesh of a couple of my Understanding Bayes posts, combining “A look at the Likelihood” and the most recent one, “” The main goal was to give the audience an appreciation for the comparative nature of Bayesian statistical evidence, as well as demonstrate how evidence in the sample has to be interpreted in the context of the specific problem. I didn’t go into Bayes factors or posterior estimation because I promised that it would be a simple and easy talk about the basic concepts.

I’m very grateful to JP de Ruiter for inviting me out to Bielefeld to give this talk, in part because it was my first talk ever! I think it went well enough, but there are a lot of things I can improve on; both in terms of slide content and verbal presentation. JP is very generous with his compliments, and he also gave me a lot of good pointers to incorporate for the next time I talk Bayes.

The main narrative of my talk was that we were to draw candies from one of two possible bags and try to figure out which bag we were drawing from. After each of the slides where I proposed the game I had a member of the audience actually come up and play it with me. The candies, bags, and cards were real but the bets were hypothetical. It was a lot of fun. 🙂

Here is a picture JP took during the talk.

Here are the slides. (You can download a pdf copy from here.)

# Understanding Bayes: Evidence vs. Conclusions

In this installment of Understanding Bayes I want to discuss the nature of Bayesian evidence and conclusions. In a previous post I focused on Bayes factors’ mathematical structure and visualization. In this post I hope to give some idea of how Bayes factors should be interpreted in context. How do we use the Bayes factor to come to a conclusion?

### How to calculate a Bayes factor

I’m going to start with an example to show the nature of the Bayes factor. Imagine I have 2 baskets with black and white balls in them. In basket A there are 5 white balls and 5 black balls. In basket B there are 10 white balls. Other than the color, the balls are completely indistinguishable. Here’s my advanced high-tech figure depicting the problem.

You choose a basket and bring it to me. The baskets aren’t labeled so I can’t tell by their appearance which one you brought. You tell me that in order to figure out which basket I have, I am allowed to take a ball out one at a time and then return it and reshuffle the balls around. What outcomes are possible here? In this case it’s super simple: I can either draw a white ball or a black ball.

If I draw a black ball I immediately know I have basket A, since this is impossible in basket B. If I draw a white ball I can’t rule anything out, but drawing a white ball counts as evidence for basket B over basket A. Since the white ball occurs with probability 1 if I have basket B, and probability .5 if I have basket A, then due to what is known as the I have evidence for basket B over basket A by a factor of 2.  See this post for a refresher on likelihoods, including the concepts such as the law of likelihood and the likelihood principle. The short version is that observations count as evidence for basket B over basket A if they are more probable given basket B than basket A.

I continue to sample, and end up with this set of observations: {W, W, W, W, W, W}. Each white ball that I draw counts as evidence of 2 for basket B over basket A, so my evidence looks like this: {2, 2, 2, 2, 2, 2}. Multiply them all together and my total evidence for B over A is 2^6, or 64. This interpretation is simple: The total accumulated data are, all together, 64 times more probable under basket B than basket A. This number represents a simple Bayes factor, or likelihood ratio.

### How to interpret a Bayes factor

In one sense, the Bayes factor always has the same interpretation in every problem: It is a ratio formed by the probability of the data under each hypothesis. It’s all about prediction. The bigger the Bayes factor the more one hypothesis outpredicted the other.

But in another sense the interpretation, and our reaction, necessarily depends on the context of the problem, and that is represented by another piece of the Bayesian machinery: The prior odds. The Bayes factor is the factor by which the data shift the balance of evidence from one hypothesis to another, and thus the amount by which the prior odds shift to posterior odds.

Imagine that before you brought me one of the baskets you told me you would draw a card from a standard, shuffled deck of cards. You have a rule: Bring me basket B if the card drawn is a black suit and bring basket A if it is a red suit. You pick a card and, without telling me what it was, bring me a basket. Which basket did you bring me? What information do I have about the basket before I get to draw a sample from it?

I know that there is a 50% chance that you choose a black card, so there is a 50% chance that you bring me basket B. Likewise for basket A. The prior probabilities in this scenario are 50% for each basket, so the prior odds for basket A vs basket B are 1-to-1. (To calculate odds you just divide the probability of one hypothesis by the other.)

Let’s say we draw our sample and get the same results as before: {W, W, W, W, W, W}. The evidence is the same: {2, 2, 2, 2, 2, 2} and the Bayes factor is the same, 2^6=64. What do we conclude from this? Should we conclude we have basket A or basket B?

The conclusion is not represented by the Bayes factor, but by the posterior odds. The Bayes factor is just one piece of the puzzle, namely the evidence contained in our sample. In order to come to a conclusion the Bayes factor has to be combined with the prior odds to obtain posterior odds. We have to take into account the information we had before we started sampling. I repeat: The posterior odds are where the conclusion resides. Not the Bayes factor.

### Posterior odds (or probabilities) and conclusions

In the example just given, the posterior odds happen to equal the Bayes factor. Since the prior odds were 1-to-1, we multiply by the Bayes factor of 1-to-64, to obtain posterior odds of 1-to-64 favoring basket B. This means that, when these are the only two possible baskets, the probability of basket A has shrunk from 50% to 2% and the probability of basket B has grown from 50% to 98%. (To convert odds to probabilities divide the odds by odds+1.) This is the conclusion, and it necessarily depends on the prior odds we assign.

Say you had a different rule for picking the baskets. Let’s say that this time you draw a card and bring me basket B if you draw a King (of any suit) and you bring me basket A if you draw any other card. Now the prior odds are 48-to-4, or 12-to-1, in favor of basket A.

The data from our sample are the same, {W, W, W, W, W, W}, and so is the Bayes factor, 2^6= 64. The conclusion is qualitatively the same, with posterior odds of 1-to-5.3 that favor basket B. This means that, again when considering these as the only two possible baskets, the probability of basket A has been shrunk from 92% to 16% and the probability of basket B has grown from 8% to 84%. The Bayes factor is the same, but we are less confident in our conclusion. The prior odds heavily favored basket A, so it takes more evidence to overcome this handicap and reach as strong a conclusion as before.

What happens when we change the rule once again: Bring me basket B if you draw a King of Hearts and basket A if you draw any other card. Now the prior odds are 51-to-1 in favor of basket A. The data are the same again, and the Bayes factor is still 64. Now the posterior odds are 1-to-1.3 in favor of basket B. This means that the probability of basket A has been shrunk from 98% to 43% and the probability of basket B has grown from 2% to 57%. The evidence, and the Bayes factor, is exactly the same — but the conclusion is totally ambiguous.

### Evidence vs. Conclusions

In each case I’ve considered, the evidence has been exactly the same: 6 draws, all white. As a corollary to the discussion above, if you try to come to conclusions based only on the Bayes factor then you are implicitly assuming prior odds of 1-to-1. I think this is unreasonable in most circumstances. When someone looks at a medium-to-large Bayes factor in a study claiming “sadness impairs color perception” (or some other ‘cute’ metaphor study published in Psych Science) and thinks, “I don’t buy this,” they are injecting their prior odds into the equation. Their implicit conclusion is: “My posterior odds for this study are not favorable.” This is the conclusion. The Bayes factor is not the conclusion.

Many studies follow-up on earlier work, so we might give favorable prior odds; thus, when we see a Bayes factor of 5 or 10 we “buy what the study is selling,” so to speak. Or the study might be testing something totally new, so we might give unfavorable prior odds; thus, when we see a Bayes factor of 5 or 10 we remain skeptical. This is just another way of saying that we may reasonably require more evidence for extraordinary claims.

### When to stop collecting data

It also follows from the above discussion that sometimes enough is enough. What I mean is that sometimes the conclusion for any reasonable prior odds assignment is strong enough that collecting more data is not worth the time, money, or energy. In the Bayesian framework the stopping rules don’t affect the Bayes factor, and subsequently they don’t affect the posterior odds. Take the second example above, where you gave me basket B if you drew any King. I had prior odds of 12-to-1 in favor of basket A, drew 6 white balls in a row, and ended up with 1-to-5.3 posterior odds in favor of basket B. This translated to a posterior probability of 84% for basket B. If I draw 2 more balls and they are both white, my Bayes factor increases to 2^8=256 (and this should not be corrected for multiple comparisons or so-called “topping up”). My posterior odds increase to roughly 1-to-21 in favor of basket B, and the probability for basket B shoots up from 84% to 99%. I would say that’s enough data for me to make a firm conclusion. But someone else might have other relevant information about the problem I’m studying, and they can come to a different conclusion.

### Conclusions are personal

There’s no reason another observer has to come to the same conclusion as me. She might have talked to you and you told her that you actually drew three cards (with replacement and reshuffle) and that you would only have brought me basket B if you drew three kings in a row. She has different information than I do, so naturally she has different prior odds (1728-to-1 in favor of basket A). She would come to a different conclusion than I would, namely that I was actually probably sampling from basket A — her posterior odds are roughly 7-to-1 in favor of basket A. We use the same evidence, a Bayes factor of 2^8=256, but come to different conclusions.

Conclusions are personal. I can’t tell you what to conclude because I don’t know all the information you have access to. But I can tell you what the evidence is, and you can use that to come to your own conclusion. In this post I used a mechanism to generate prior odds that are intuitive and obvious, but we come to our scientific judgments through all sorts of ways that aren’t always easily expressed or quantified. The idea is the same however you come to your prior odds: If you’re skeptical of a study that has a large Bayes factor, then you assigned it strongly unfavorable prior odds.

This is why I, and other Bayesians, advocate for reporting the Bayes factor in experiments. It is not because it tells someone what to conclude from the study, but that it lets them take the information contained in your data to come to their own conclusion. When you report your own Bayes factors for your experiments, in your discussion you might consider how people with different prior odds will react to your evidence. If your Bayes factor is not strong enough to overcome a skeptic’s prior odds, then you may consider collecting more data until it is. If you’re out of resources and the Bayes factor is not strong enough to overcome the prior odds of a moderate skeptic, then there is nothing wrong with acknowledging that other people may reasonably come to different conclusions about your study. Isn’t that how science works?

### Bottom line

If you want to come to a conclusion you need the posterior. If you want to make predictions about future sampling you need the posterior. If you want to make decisions you need the posterior (and a utility function; a topic for future blog). If you try to do all this with only the Bayes factor then you are implicitly assuming equal prior odds — which I maintain are almost never appropriate. (Insofar as you do ignore the prior and posterior, then do not be surprised when your Bayes factor simulations find strange results.) In the Bayesian framework each piece has its place. Bayes factors are an important piece of the puzzle, but they are not the only piece. They are simply the most basic piece from my perspective (after the sum and product rules) because they represent the evidence you accumulated in your sample. When you need to do something other than summarize evidence you have to expand your statistical arsenal.

For more introductory material on Bayesian inference, see the Understanding Bayes hub here.

#### Technical caveat

It’s important to remember that everything is relative and conditional in the Bayesian framework. The posterior probabilities I mention in this post are simply the probabilities of the baskets under the assumption that those are the only relevant hypotheses. They are not absolute probabilities. In other words, instead of writing the posterior probability as P(H|D), it should really be written P(H|D,M), where M is the conditional that the only hypotheses considered are in the following model index: M= {A, B, … K). This is why I personally prefer to use odds notation, since it makes the relativity explicit.

# Understanding Bayes: Visualization of the Bayes Factor

In the first post of the Understanding Bayes series I said:

The likelihood is the workhorse of Bayesian inference. In order to understand Bayesian parameter estimation you need to understand the likelihood. In order to understand Bayesian model comparison (Bayes factors) you need to understand the likelihood and likelihood ratios.

I’ve shown in another post how the likelihood works as the updating factor for turning priors into posteriors for parameter estimation. In this post I’ll explain how using Bayes factors for model comparison can be conceptualized as a simple extension of likelihood ratios.

## There’s that coin again

Imagine we’re in a similar situation as before: I’ve flipped a coin 100 times and it came up 60 heads and 40 tails. The likelihood function for binomial data in general is:

$\ P \big(X = x \big) \propto \ p^x \big(1-p \big)^{n-x}$

and for this particular result:

$\ P \big(X = 60 \big) \propto \ p^{60} \big(1-p \big)^{40}$

The corresponding likelihood curve is shown below, which displays the relative likelihood for all possible simple (point) hypotheses given this data. Any likelihood ratio can be calculated by simply taking the ratio of the different hypotheses’s heights on the curve.

In that previous post I compared the fair coin hypothesis — H0: P(H)=.5 — vs one particular trick coin hypothesis — H1: P(H)=.75. For 60 heads out of 100 tosses, the likelihood ratio for these hypotheses is L(.5)/L(.75) = 29.9. This means the data are 29.9 times as probable under the fair coin hypothesis than this particular trick coin hypothesisBut often we don’t have theories precise enough to make point predictions about parameters, at least not in psychology. So it’s often helpful if we can assign a range of plausible values for parameters as dictated by our theories.

## Enter the Bayes factor

Calculating a Bayes factor is a simple extension of this process. A Bayes factor is a weighted average likelihood ratio, where the weights are based on the prior distribution specified for the hypotheses. For this example I’ll keep the simple fair coin hypothesis as the null hypothesis — H0: P(H)=.5 — but now the alternative hypothesis will become a composite hypothesis — H1: P(θ). (footnote 1) The likelihood ratio is evaluated at each point of P(θ) and weighted by the relative plausibility we assign that value. Then once we’ve assigned weights to each ratio we just take the average to get the Bayes factor. Figuring out how the weights should be assigned (the prior) is the tricky part.

Imagine my composite hypothesis, P(θ), is a combination of 21 different point hypotheses, all evenly spaced out between 0 and 1 and all of these points are weighted equally (not a very realistic hypothesis!). So we end up with P(θ) = {0, .05, .10, .15, . . ., .9, .95, 1}. The likelihood ratio can be evaluated at every possible point hypothesis relative to H0, and we need to decide how to assign weights. This is easy for this P(θ); we assign zero weight for every likelihood ratio that is not associated with one of the point hypotheses contained in P(θ), and we assign weights of 1 to all likelihood ratios associated with the 21 points in P(θ).

This gif has the 21 point hypotheses of P(θ) represented as blue vertical lines (indicating where we put our weights of 1), and the turquoise tracking lines represent the likelihood ratio being calculated at every possible point relative to H0: P(H)=.5. (Remember, the likelihood ratio is the ratio of the heights on the curve.) This means we only care about the ratios given by the tracking lines when the dot attached to the moving arm aligns with the vertical P(θ) lines. [edit: this paragraph added as clarification]

The 21 likelihood ratios associated with P(θ) are:

{~0, ~0, ~0, ~0, ~0, ~0, ~0, ~0, .002, .08, 1, 4.5, 7.5, 4.4, .78, .03, ~0, ~0, ~0, ~0, ~0}

Since they are all weighted equally we simply average, and obtain BF = 18.3/21 = .87. In other words, the data (60 heads out of 100) are 1/.87 = 1.15 times more probable under the null hypothesis — H0: P(H)=.5 — than this particular composite hypothesis — H1: P(θ). Entirely uninformative! Despite tossing the coin 100 times we have extremely weak evidence that is hardly worth even acknowledging. This happened because much of P(θ) falls in areas of extremely low likelihood relative to H0, as evidenced by those 13 zeros above. P(θ) is flexible, since it covers the entire possible range of θ, but this flexibility comes at a price. You have to pay for all of those zeros with a lower weighted average and a smaller Bayes factor.

Now imagine I had seen a trick coin like this before, and I know it had a slight bias towards landing heads. I can use this information to make more pointed predictions. Let’s say I define P(θ) as 21 equally weighted point hypotheses again, but this time they are all equally spaced between .5 and .75, which happens to be the highest density region of the likelihood curve (how fortuitous!). Now P(θ) = {.50, .5125, .525, . . ., .7375, .75}.

The 21 likelihood ratios associated with the new P(θ) are:

{1.00, 1.5, 2.1, 2.8, 4.5, 5.4, 6.2, 6.9, 7.5, 7.3, 6.9, 6.2, 4.4, 3.4, 2.6, 1.8, .78, .47, .27, .14, .03}

They are all still weighted equally, so the simple average is BF = 72/21 = 3.4. Three times more informative than before, and in favor of P(θ) this time! And no zeros. We were able to add theoretically relevant information to H1 to make more accurate predictions, and we get rewarded with a Bayes boost. (But this result is only 3-to-1 evidence, which is still fairly weak.)

This new P(θ) is risky though, because if the data show a bias towards tails or a more extreme bias towards heads then it faces a very heavy penalty (many more zeros). High risk = high reward with the Bayes factor. Make pointed predictions that match the data and get a bump to your BF, but if you’re wrong then pay a steep price. For example, if the data were 60 tails instead of 60 heads the BF would be 10-to-1 against P(θ) rather than 3-to-1 for P(θ)!

Now, typically people don’t actually specify hypotheses like these. Typically they use continuous distributions, but the idea is the same. Take the likelihood ratio at each point relative to H0, weigh according to plausibilities given in P(θ), and then average.

## A more realistic (?) example

Imagine you’re walking down the sidewalk and you see a shiny piece of foreign currency by your feet. You pick it up and want to know if it’s a fair coin or an unfair coin. As a Bayesian you have to be precise about what you mean by fair and unfair. Fair is typically pretty straightforward — H0: P(H)=.5 as before — but unfair could mean anything. Since this is a completely foreign coin to you, you may want to be fairly open-minded about it. After careful deliberation, you assign P(θ) a beta distribution, with shape parameters 10 and 10. That is, H1: P(θ) ~ Beta(10, 10). This means that if the coin isn’t fair, it’s probably close to fair but it could reasonably be moderately biased, and you have no reason to think it is particularly biased to one side or the other.

Now you build a perfect coin-tosser machine and set it to toss 100 times (but not any more than that because you haven’t got all day). You carefully record the results and the coin comes up 33 heads out of 100 tosses. Under which hypothesis are these data more probable, H0 or H1? In other words, which hypothesis did the better job predicting these data?

This may be a continuous prior but the concept is exactly the same as before: weigh the various likelihood ratios based on the prior plausibility assignment and then average. The continuous distribution on P(θ) can be thought of as a set of many many point hypotheses spaced very very close together. So if the range of θ we are interested in is limited to 0 to 1, as with binomials and coin flips, then a distribution containing 101 point hypotheses spaced .01 apart, can effectively be treated as if it were continuous. The numbers will be a little off but all in all it’s usually pretty close. So imagine that instead of 21 hypotheses you have 101, and their relative plausibilities follow the shape of a Beta(10, 10). (footnote 2)

Since this is not a uniform distribution, we need to assign varying weights to each likelihood ratio. Each likelihood ratio associated with a point in P(θ) is simply multiplied by the respective density assigned to it under P(θ). For example, the density of P(θ) at .4 is 2.44. So we multiply the likelihood ratio at that point, L(.4)/L(.5) = 128, by 2.44, and add it to the accumulating total likelihood ratio. Do this for every point and then divide by the total number of points, in this case 101, to obtain the approximate Bayes factor. The total weighted likelihood ratio is 5564.9, divide it by 101 to get 55.1, and there’s the Bayes factor. In other words, the data are roughly 55 times more probable under this composite H1 than under H0. The alternative hypothesis H1 did a much better job predicting these data than did the null hypothesis H0.

The actual Bayes factor is obtained by integrating the likelihood with respect to H1’s density distribution and then dividing by the (marginal) likelihood of H0. Essentially what it does is cut P(θ) into slices infinitely thin before it calculates the likelihood ratios, re-weighs, and averages. That Bayes factor comes out to 55.7, which is basically the same thing we got through this ghetto visualization demonstration!

## Take home

The take-home message is hopefully pretty clear at this point: When you are comparing a point null hypothesis with a composite hypothesis, the Bayes factor can be thought of as a weighted average of every point hypothesis’s likelihood ratio against H0, and the weights are determined by the prior density distribution of H1. Since the Bayes factor is a weighted average based on the prior distribution, it’s really important to think hard about the prior distribution you choose for H1. In a previous post, I showed how different priors can converge to the same posterior with enough data. The priors are often said to “wash out” in estimation problems like that. This is not necessarily the case for Bayes factors. The priors you choose matter, so think hard!

## Notes

Footnote 1: A lot of ink has been spilled arguing about how one should define P(θ). I talked about it a little a previous post.

Footnote 2: I’ve rescaled the likelihood curve to match the scale of the prior density under H1. This doesn’t affect the values of the Bayes factor or likelihood ratios because the scaling constant cancels itself out.