Understanding Bayes: How to cheat to get the maximum Bayes factor for a given p value

OR less click-baity: What is the maximum Bayes factor you can get for a given p value? (Obvious disclaimer: Don’t cheat)

Starting to use and interpret Bayesian statistics can be hard at first. A recent recommendation that I like is from Zoltan Dienes and Neil Mclatchie, to “Report a B for every p.” Meaning, for every p value in the paper report a corresponding Bayes factor. This way the psychology community can start to build an intuition about how these two kinds of results can correspond. I think this is a great way to start using Bayes. And if as time goes on you want to flush those ps down the toilet, I won’t complain.

Researchers who start to report both Bayesian and frequentist results often go through a phase where they are surprised to find that their p<.05 results correspond to weak Bayes factors. In this Understanding Bayes post I hope to pump your intuitions a bit as to why this is the case. There is, in fact, an absolute maximum Bayes factor for a given p value. There are also other soft maximums it can achieve for different classes of prior distributions. And these maximum BFs may not be as high as you expect.

Absolute Maximum

The reason for the absolute maximum is actually straightforward. The Bayes factor compares how accurately two or more competing hypotheses predict the observed data. Usually one of those hypotheses is a point null hypothesis, which says there is no effect in the population (however defined). The alternative can be anything you like. It could be a point hypothesis motivated by theory or that you take from previous literature (uncommon), or it can be a (half-)normal (or other) distribution centered on the null (more common), or anything else. In any case, the fact is that to achieve the absolute maximum Bayes factor for a given p value you have to cheat. In real life you can never reach the absolute maximum in a normal course of analysis so its only use is as a benchmark illustration.

You have to make your alternative hypothesis the exact point hypothesis that maximizes the likelihood of the data. The likelihood function ranks all the parameter values by how well they predict the data, so if you make your point hypothesis equal to the mode of the likelihood function, it means that no other hypothesis or population parameter could make the data more likely. This illicit prior is known as the oracle prior, because it is the prior you would choose if you could see the result ahead of time. So in the figure below, the oracle prior would correspond to the high dot on the curve at the mode, and the null hypothesis is the lower dot on the curve. The Bayes factor is then just the ratio of these heights.

When you are doing a t-test, for example, the maximum of the likelihood function is simply the sample mean. So in this case, the oracle prior is a point hypothesis at exactly the sample mean. Let’s assume that we know the population SD=10, so we’re only interested in the population mean. We collect 100 participants and the sample mean we get is 1.96. Our z score in this case is

z = mean / standard error = 1.96 / (10/√100) = 1.96.

This means we obtain a p value of exactly .05. Publication and glory await us. But, in sticking with our B for every p mantra, we decide to calculate an oracle Bayes factor just to be complete. This can easily be done in R using the following 1 line of code:

dnorm(1.96, 1.96, 1)/dnorm(1.96, 0, 1)

And the answer you get is BF = 6.83. This is the absolute maximum Bayes factor you can possibly get for a p value that equals .05 in a t test (you get similar BFs for other types of tests). That is the amount of evidence that would bring a neutral reader who has prior probabilities of 50% for the null and 50% for the alternative to posterior probabilities of 12.8% for the null and 87.2% for the alternative. You might call that moderate evidence depending on the situation. For p of .01, this maximum increases to ~27.5, which is quite strong in most cases. But these values are for the best case ever, where you straight up cheat. When you can’t blatantly cheat the results are not so good.

Soft Maximum

Of course, nobody in their right mind would accept your analysis if you used an oracle prior. It is blatant cheating — but it gives a good benchmark. For p of .05 and the oracle prior, the best BF you can ever get is slightly less than 7. If you can’t blatantly cheat by using an oracle prior, the maximum Bayes factor you can get obviously won’t be as high. But it may surprise you how much smaller the maximum becomes if you decide to cheat more subtly.

The priors most people use for the alternative hypothesis in the Bayes factor are not point hypotheses, but distributed hypotheses. A common recommendation is a unimodal (i.e., one-hump) symmetric prior centered on the null hypothesis value. (There are times where you wouldn’t want to use a prior centered on the null value, but in those cases the maximum BF goes back to being the BF you get using an oracle prior.) I usually recommend using normal distribution priors, and JASP software uses a Cauchy distribution which is similar but with fatter tails. Most of the time the BFs you get are very similar.

So imagine that instead of using the blatantly cheating oracle prior, you use a subtle oracle prior. Instead of a point alternative at the observed mean, you use a normal distribution and pick the scale (i.e., the SD) of your prior to maximize the Bayes factor. There is a formula for this, but the derivation is very technical so I’ll let you read Berger and Sellke (1987, especially section 3) if you’re into that sort of torture.

It turns out, once you do the math, that when using a normal distribution prior the maximum Bayes factor you can get for a value of .05 is BF = 2.1. That is the amount of evidence that would bring a neutral reader who has prior probabilities of 50% for the null and 50% for the alternative to posterior probabilities of 32% for the null and 68% for the alternative. Barely different! That is very weak evidence. The maximum normal prior BF corresponding to of .01 is BF = 6.5. That is still hardly convincing evidence! You can find this bound for any t value you like (for any t greater than 1) using the R code below:

t = 1.96
maxBF = 1/(sqrt(exp(1))*t*exp(-t^2/2))

(You can get slightly different maximum values for different formulations of problem. Another form due to Sellke, Bayarri, & Berger [2001] is 1/[-e*p*ln(p)] for p<~.4, which for p=.05 returns BF = 2.45)

You might say, “Wait no I have a directional prediction, so I will use a half-normal prior that allows only positive values for the population mean. What is my maximum BF now?” Luckily the answer is simple: Just multiply the old maximum by:

2*(1 – p/2)

So for p of .05 and .01 the maximum 1-sided BFs are 4.1 and 13, respectively. (By the way, this trick works for converting most common BFs from 2- to 1-sided.)

Take home message

Do not be surprised if you start reporting Bayes factors and find that what you thought was strong evidence based on a p value of .05 or even .01 translates to a quite weak Bayes factor.

And I think this goes without saying, but don’t try to game your Bayes factors. We’ll know. It’s obvious. The best thing to do is use the prior distribution you find most reasonable for the problem at hand and then do a robustness check by seeing how much the conclusion you draw depends on the specific prior you choose. JASP software can do this for you automatically in many cases (e.g., for the Bayesian t-test; ps check out our official JASP tutorial videos!).

R code

The following is the R code to reproduce the figure, to find the max BF for oracle priors, and to find the max BF for subtle oracle priors. Tinker with it and see how your intuitions match the answers you get!

 

 

A Bayesian perspective on the Reproducibility Project: Psychology

It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.

–Harold Jeffreys, 1939

Joachim Vandekerckhove (@VandekerckhoveJ) and I have just published a Bayesian reanalysis of the Reproducibility Project: Psychology in PLOS ONE (CLICK HERE). It is open access, so everyone can read it! Boo paywalls! Yay open access! The review process at PLOS ONE was very nice; we had two rounds of reviews that really helped us clarify our explanations of the method and results.

Oh and it got a new title: “A Bayesian perspective on the Reproducibility Project: Psychology.” A little less presumptuous than the old blog’s title. Thanks to the RPP authors sharing all of their data, we research parasites were able to find some interesting stuff. (And thanks Richard Morey (@richarddmorey) for making this great badge)

parasite

TLDR: One of the main takeaways from the paper is the following: We shouldn’t be too surprised when psychology experiments don’t replicate, given the evidence in the original studies is often unacceptably weak to begin with!

What did we do?

Here is the abstract from the paper:

We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.

In the paper we try to answer four questions: 1) How much evidence is there in the original studies? 2) If we account for the possibility of publication bias, how much evidence is left in the original studies? 3) How much evidence is there in the replication studies? 4) How consistent is the evidence between (bias-corrected) original studies and replication studies?

We implement a very neat technique called Bayesian model averaging to account for publication bias in the original studies. The method is fairly technical, so I’ve put the topic in the Understanding Bayes queue (probably the next post in the series). The short version is that each Bayes factor consists of eight likelihood functions that get weighted based on the potential bias in the original result. There are details in the paper, and much more technical detail in this paper (Guan and Vandekerckhove, 2015). Since the replication studies would be published regardless of outcome, and were almost certainly free from publication bias, we can calculate regular (bias free) Bayes factors for them.

Results

There are only 8 studies where both the bias mitigated original Bayes factors and the replication Bayes factors are above 10 (highlighted with the blue hexagon). That is, both experiment attempts provide strong evidence. It may go without saying, but I’ll say it anyway: These are the ideal cases. 

(The prior distribution for all Bayes factors is a normal distribution with mean of zero and variance of one. All the code is online HERE if you’d like to see how different priors change the result; our sensitivity analysis didn’t reveal any major dependencies on the exact prior used.)

The majority of studies (46/72) have both bias mitigated original and replication Bayes factors in the 1/10< BF <10 range (highlighted with the red box). These are cases where both study attempts only yielded weak evidence.

Table3

Overall, both attempts for most studies provided only weak evidence. There is a silver/bronze/rusty-metal lining, in that when both study attempts obtain only weak Bayes factors, they are technically providing consistent amounts of evidence. But that’s still bad, because “consistency” just means that we are systematically gathering weak evidence!

Using our analysis, no studies provided strong evidence that favored the null  hypothesis in either the original or replication.

It is interesting to consider the cases where one study attempt found strong evidence but another did not. I’ve highlighted these cases in blue in the table below. What can explain this?

Table3

One might be tempted to manufacture reasons that explain this pattern of results, but before you do that take a look at the figure below. We made this figure to highlight one common aspect of all study attempts that find weak evidence in one attempt and strong evidence in another: Differences in sample size. In all cases where the replication found strong evidence and the original study did not, the replication attempt had the larger sample size. Likewise, whenever the original study found strong evidence and the replication did not, the original study had a larger sample size.

RPP

Figure 2. Evidence resulting from replicated studies plotted against evidence resulting from the original publications. For the original publications, evidence for the alternative hypothesis was calculated taking into account the possibility of publication bias. Small crosses indicate cases where neither the replication nor the original gave strong evidence. Circles indicate cases where one or the other gave strong evidence, with the size of each circle proportional to the ratio of the replication sample size to the original sample size (a reference circle appears in the lower right). The area labeled ‘replication uninformative’ contains cases where the original provided strong evidence but the replication did not, and the area labeled ‘original uninformative’ contains cases where the reverse was true. Two studies that fell beyond the limits of the figure in the top right area (i.e., that yielded extremely large Bayes factors both times) and two that fell above the top left area (i.e., large Bayes factors in the replication only) are not shown. The effect that relative sample size has on Bayes factor pairs is shown by the systematic size difference of circles going from the bottom right to the top left. All values in this figure can be found in S1 Table.

Abridged conclusion (read the paper for more! More what? Nuance, of course. Bayesians are known for their nuance…)

Even when taken at face value, the original studies frequently provided only weak evidence when analyzed using Bayes factors (i.e., BF < 10), and as you’d expect this already small amount of evidence shrinks even more when you take into account the possibility of publication bias. This has a few nasty implications. As we say in the paper,

In the likely event that [the original] observed effect sizes were inflated … the sample size recommendations from prospective power analysis will have been underestimates, and thus replication studies will tend to find mostly weak evidence as well.

According to our analysis, in which a whopping 57 out of 72 replications had 1/10 < BF < 10, this appears to have been the case.

We also should be wary of claims about hidden moderators. We put it like this in the paper,

The apparent discrepancy between the original set of results and the outcome of the Reproducibility Project can be adequately explained by the combination of deleterious publication practices and weak standards of evidence, without recourse to hypothetical hidden moderators.

Of course, we are not saying that hidden moderators could not have had an influence on the results of the RPP. The statement is merely that we can explain the results reasonably well without necessarily bringing hidden moderators into the discussion. As Laplace would say: We have no need of that hypothesis.

So to sum up,

From a Bayesian reanalysis of the Reproducibility Project: Psychology, we conclude that one reason many published effects fail to replicate appears to be that the evidence for their existence was unacceptably weak in the first place.

With regard to interpretation of results — I will include the same disclaimer here that we provide in the paper:

It is important to keep in mind, however, that the Bayes factor as a measure of evidence must always be interpreted in the light of the substantive issue at hand: For extraordinary claims, we may reasonably require more evidence, while for certain situations—when data collection is very hard or the stakes are low—we may satisfy ourselves with smaller amounts of evidence. For our purposes, we will only consider Bayes factors of 10 or more as evidential—a value that would take an uninvested reader from equipoise to a 91% confidence level. Note that the Bayes factor represents the evidence from the sample; other readers can take these Bayes factors and combine them with their own personal prior odds to come to their own conclusions.

All of the results are tabulated in the supplementary materials (HERE) and the code is on github (CODE HERE).


 

More disclaimers, code, and differences from the old reanalysis

Disclaimer:

All of the results are tabulated in a table in the supplementary information (link), and MATLAB code to reproduce the results and figures is provided online (CODE HERE). When interpreting these results, we use a Bayes factor threshold of 10 to represent strong evidence. If you would like to see how the results change when using a different threshold, all you have to do is change the code in line 118 of the ‘bbc_main.m’ file to whatever thresholds you prefer.

#######

Important note: The function to calculate the mitigated Bayes factors is a prototype and is not robust to misuse. You should not use it unless you know what you are doing!

#######

A few differences between this paper and an old reanalysis:

A few months back I posted a Bayesian reanalysis of the Reproducibility Project: Psychology, in which I calculated replication Bayes factors for the RPP studies. This analysis took the posterior distribution from the original studies as the prior distribution in the replication studies to calculate the Bayes factor. So in that calculation, the hypotheses being compared are: H_0 “There is no effect” vs. H_A “The effect is close to that found by the original study.” It also did not take into account publication bias.

This is important: The published reanalysis is very different from the one in the first blog post.

Since the posterior distributions from the original studies were usually centered on quite large effects, the replication Bayes factors could fall in a wide range of values. If a replication found a moderately large effect, comparable to the original, then the Bayes factor would very largely favor H_A. If the replication found a small-to-zero effect (or an effect in the opposite direction), the Bayes factor would very largely favor H_0. If the replication found an effect in the middle of the two hypotheses, then the Bayes factor would be closer to 1, meaning the data fit both hypotheses equally bad. This last case happened when the replications found effects in the same direction as the original studies but of smaller magnitude.

These three types of outcomes happened with roughly equal frequency; there were lots of strong replications (big BF favoring H_A), lots of strong failures to replicate (BF favoring H_0), and lots of ambiguous results (BF around 1).

The results in this new reanalysis are not as extreme because the prior distribution for H_A is centered on zero, which means it makes more similar predictions to H_0 than the old priors. Whereas roughly 20% of the studies in the first reanalysis were strongly in favor of H_0 (BF>10), that did not happen a single time in the new reanalysis. This new analysis also includes the possibility of a biased publication processes, which can have a large effect on the results.

We use a different prior so we get different results. Hence the Jeffreys quote at the top of the page.

 

 

Slides: “Bayesian statistical concepts: A gentle introduction”

I recently gave a talk in Bielefeld, Germany with the title “Bayesian statistical concepts: A gentle introduction.” I had a few people ask for the slides so I figured I would post them here. If you are a regular reader of this blog, it should all look pretty familiar. It was a mesh of a couple of my Understanding Bayes posts, combining “A look at the Likelihood” and the most recent one, “Evidence vs. Conclusions.” The main goal was to give the audience an appreciation for the comparative nature of Bayesian statistical evidence, as well as demonstrate how evidence in the sample has to be interpreted in the context of the specific problem. I didn’t go into Bayes factors or posterior estimation because I promised that it would be a simple and easy talk about the basic concepts.

I’m very grateful to JP de Ruiter for inviting me out to Bielefeld to give this talk, in part because it was my first talk ever! I think it went well enough, but there are a lot of things I can improve on; both in terms of slide content and verbal presentation. JP is very generous with his compliments, and he also gave me a lot of good pointers to incorporate for the next time I talk Bayes.

The main narrative of my talk was that we were to draw candies from one of two possible bags and try to figure out which bag we were drawing from. After each of the slides where I proposed the game I had a member of the audience actually come up and play it with me. The candies, bags, and cards were real but the bets were hypothetical. It was a lot of fun. 🙂

Here is a picture JP took during the talk.

Bielefeld Bayes intro

Here are the slides. (You can download a pdf copy from here.)

The One-Sided P-Value Paradox

Today on Twitter there was some chatting about one-sided p-values. Daniel Lakens thinks that by 2018 we’ll see a renaissance of one-sided p-values due to the advent of preregistration. There was a great conversation that followed Daniel’s tweet, so go click the link above and read it and we’ll pick this back up once you do.

Okay.

As you have seen, and is typical of discussions around p-values in general, the question of evidence arises. How do one-sided p-values relate to two-sided p-values as measures of statistical evidence? In this post I will argue that thinking through the logic of one-sided p-values highlights a true illogic of significance testing. This example is largely adapted from Royall’s 1997 book.

The setup

The idea behind Fisher’s significance tests goes something like this. We have a hypothesis that we wish to find evidence against. If the evidence is strong enough then we can reject this hypothesis. I will use the binomial example because it lends itself to good storytelling, but this works for any test.

Premise A: Say I wish to determine if my coin is unfair. That is, I want to reject the hypothesis, H1, that the probability of heads is equal to ½. This is a standard two-sided test. If I flip my coin a few times and observe x heads, I can reject H1 (at level α) if the probability of obtaining x or more heads is less than α/2. If my α is set to the standard level, .05, then I can reject H1 if Pr(x or more heads) ≤ .025. In this framework, I have strong evidence that the probability of heads is not equal to ½ if my p-value is lower than .025. That is, I can claim (at level α) that the probability of heads is either greater than ½ or less than ½ (proposition A).

Premise B: If I have some reason to think the coin might be biased one way or the other, say there is a kid on the block with a coin biased to come up heads more often than not, then I might want to use a one-sided test. In this test, the hypothesis to be rejected, H2, is that the probability of heads is less than or equal to ½. In this case I can reject H2 (at level α) if the probability of obtaining x or more heads is less than α. If my α is set to the standard level again, .05, then I can reject H2 if Pr(x or more heads) < .05. Now I have strong evidence that the probability of heads is not equal to ½, nor is it less than ½, if my p-value is less than .05. That is, I can claim (again at level α) that the probability of heads is greater than ½.  (proposition B).

As you can see, proposition B is a stronger logical claim than proposition A. Saying that my car is faster than your car is making a stronger claim than saying that my car is either faster or slower than your car.

The paradox

If I obtain a result x, such that α/2 < Pr(x or more heads) < α, (e.g., .025 < p < .05), then I have strong evidence for the conclusion that the probability of heads is greater than ½ (see proposition B). But at the same time I do not have strong evidence for the conclusion that the probability of heads is > ½ or < ½ (see proposition A).

I have defied the rules of logic. I have concluded the stronger proposition, probability of heads > ½, but I cannot conclude the weaker proposition, probability of heads > ½ or < ½. As Royall (1997, p. 77) would say, if the evidence justifies the conclusion that the probability of heads is greater than ½ then surely it justifies the weaker conclusion that the probability of heads is either > ½ or < ½.

Should we use one-sided p-values?

Go ahead, I can’t stop you. But be aware that if you try to interpret p-values, either one- or two-sided, as measures of statistical (logical) evidence then you may find yourself in a p-value paradox.


References and further reading:

Royall, R. (1997). Statistical evidence: A likelihood paradigm (Vol. 71). CRC press. Chapter 3.7.