Slides: “Bayesian Bias Correction: Critically evaluating sets of studies in the presence of publication bias”

I recently gave a lab presentation on the work we have been doing to attempt to mitigate the nefarious effects of publication bias, and I thought I’d share the slides here. The first iteration of the method (details given in Guan and Vandekerckhove, 2016), summarized in the first half of the slides, could be applied to single studies or to cases where a fixed effects meta-analysis would be appropriate. I have been working to extend the method to cases where one would perform a random-effects meta-analysis to account for heterogeneity in effects across studies, summarized in the second half of the slides. We’re working now to write this extension up and tidy up the code for dissemination.

Here are the slides (non-slideshare pdf available here):

 

A quick comment on recent BF (vs p-value) error control blog posts

There have recently been two stimulating posts regarding error control for Bayes factors. (Stimulating enough to get me to write this, at least.) Daniel Lakens commented on how Bayes factors can vary across studies due to sampling error. Tim van der Zee compared the type 1 and type 2 error rates for using p-values versus using BFs. My comment is not so much to pass judgment on the content of the posts (other than this quick note that they are not really proper Bayesian simulations), but to suggest an easier way to do what they are already doing. They both use simulations to get their error rates (which can take ages when you have lots of groups), but in this post I’d like to show a way to find the exact same answers without simulation, by just thinking about the problem from a slightly different angle.

Lakens and van der Zee both set up their simulations as follows: For a two sample t-test, assume a true underlying population effect size (i.e., δ), a fixed sample size per group (n1 and n2),  and calculate a Bayes factor comparing a point null versus an alternative hypothesis that assigns δ a prior distribution of Cauchy(0, .707) [the default prior for the Bayesian t-test]. Then simulate a bunch of sample t-values from the underlying effect size, plug them into the BayesFactor R package, and see what proportion of BFs are above, below or between certain values (both happen to focus on 3 and 1/3). [This is a very common simulation setup that I see in many blogs these days.]

I’ll just use a couple of representative examples from van der Zee’s post to show how to do this. Let’s say n1 = n2 = 50 and we use the default Cauchy prior on the alternative. In this setup, one can very easily calculate the resulting BF for any observed t-value using the BayesFactor R package. A BF of 3 corresponds to an observed | t | = ~2.47; a BF of 1/3 corresponds to | t | = ~1. These are your critical t values. Any t value greater than 2.47 (or less than -2.47) will have a BF > 3. Any t value between -1 and 1 will have BF < 1/3. Any t value between 1 and 2.47 (or between -1 and -2.47) will have 1/3 < BF < 3. All we have to do now is find out what proportion of sample t values would fall in these regions for the chosen underlying effect size, which is done by finding the area of the sampling distribution between the various critical values.

easier type 1 errors

If the underlying effect size for the simulation is δ = 0 (i.e., the null hypothesis is true), then observed t-values will follow the typical central t-distribution. For 98 degrees of freedom, this looks like the following.

Rplot05

I have marked the critical t values for BF = 3 and BF = 1/3 found above. van der Zee denotes BF > 3 as type 1 errors when δ = 0. The type 1 error rate is found by calculating the area under this curve in the tails beyond | t | = 2.47. A simple line in r gives the answer:

2*pt(-2.47,df=98)
[.0152]

The type 1 error rate is thus 1.52% (van der Zee’s simulations found 1.49%, see his third table). van der Zee notes that this is much lower than the type 1 error rate of 5% for the frequentist t test (the area in the tails beyond | t | = 1.98) because the t criterion is much higher for a Bayes factor of 3 than a p value of .05.  [As an aside, if one wanted the BF criterion corresponding to a type 1 error rate of 5%, it is BF > 1.18 in this case (i.e., this is the BF obtained from | t | = 1.98). That is, for this setup, 5% type 1 error rate is achieved nearly automatically.]

The rate at which t values fall between -2.47 and -1 and between 1 and 2.47 (i.e., find 1/3 < BF < 3) is the area of this curve between -2.47 and -1 plus the area between 1 and 2.47, found by:

2*(pt(-1,df=98)-pt(-2.47,df=98))
[1] 0.3045337

The rate at which t values fall between -1 and 1 (i.e., find BF < 1/3) is the area between -1 and 1, found by:

pt(1,df=98)-pt(-1,df=98)
[1] 0.6802267

easier type 2 errors

If the underlying effect size for the simulation is changed to δ  = .4 (another one of van der Zee’s examples, and now similar to Lakens’s example), the null hypothesis is then false and the relevant t distribution is no longer centered on zero (and is asymmetric). To find the new sampling distribution, called the noncentral t-distribution, we need to find the noncentrality parameter for the t-distribution that corresponds to δ = .4 when n1 = n2 = 50. For a two-sample t test, this is found by a simple formula, ncp = δ * √(1/n1 + 1/n2); in this case we have ncp = .4 * √(1/50 + 1/50) = 2. The noncentral t-distribution for δ=.4 and 98 degrees of freedom looks like the following.

deltazero

I have again marked the relevant critical values. van der Zee denotes BF < 1/3 as type 2 errors when δ ≠ 0 (and Lakens is also interested in this area). The rate at which this occurs is once again the area under the curve between -1 and 1, found by:

pt(1,df=98,ncp=2)-pt(-1,df=98,ncp=2)
[1] 0.1572583

The type 2 error rate is thus 15.7% (van der Zee’s simulation finds 16.8%, see his first table). The other rates of interest are similarly found.

Conclusion

You don’t necessarily need to simulate this stuff! You can save a lot of simulation time by working it out with a little arithmetic plus a few easy lines of code.

 

 

Understanding Bayes: How to cheat to get the maximum Bayes factor for a given p value

OR less click-baity: What is the maximum Bayes factor you can get for a given p value? (Obvious disclaimer: Don’t cheat)

Starting to use and interpret Bayesian statistics can be hard at first. A recent recommendation that I like is from Zoltan Dienes and Neil Mclatchie, to “Report a B for every p.” Meaning, for every p value in the paper report a corresponding Bayes factor. This way the psychology community can start to build an intuition about how these two kinds of results can correspond. I think this is a great way to start using Bayes. And if as time goes on you want to flush those ps down the toilet, I won’t complain.

Researchers who start to report both Bayesian and frequentist results often go through a phase where they are surprised to find that their p<.05 results correspond to weak Bayes factors. In this Understanding Bayes post I hope to pump your intuitions a bit as to why this is the case. There is, in fact, an absolute maximum Bayes factor for a given p value. There are also other soft maximums it can achieve for different classes of prior distributions. And these maximum BFs may not be as high as you expect.

Absolute Maximum

The reason for the absolute maximum is actually straightforward. The Bayes factor compares how accurately two or more competing hypotheses predict the observed data. Usually one of those hypotheses is a point null hypothesis, which says there is no effect in the population (however defined). The alternative can be anything you like. It could be a point hypothesis motivated by theory or that you take from previous literature (uncommon), or it can be a (half-)normal (or other) distribution centered on the null (more common), or anything else. In any case, the fact is that to achieve the absolute maximum Bayes factor for a given p value you have to cheat. In real life you can never reach the absolute maximum in a normal course of analysis so its only use is as a benchmark illustration.

You have to make your alternative hypothesis the exact point hypothesis that maximizes the likelihood of the data. The likelihood function ranks all the parameter values by how well they predict the data, so if you make your point hypothesis equal to the mode of the likelihood function, it means that no other hypothesis or population parameter could make the data more likely. This illicit prior is known as the oracle prior, because it is the prior you would choose if you could see the result ahead of time. So in the figure below, the oracle prior would correspond to the high dot on the curve at the mode, and the null hypothesis is the lower dot on the curve. The Bayes factor is then just the ratio of these heights.

When you are doing a t-test, for example, the maximum of the likelihood function is simply the sample mean. So in this case, the oracle prior is a point hypothesis at exactly the sample mean. Let’s assume that we know the population SD=10, so we’re only interested in the population mean. We collect 100 participants and the sample mean we get is 1.96. Our z score in this case is

z = mean / standard error = 1.96 / (10/√100) = 1.96.

This means we obtain a p value of exactly .05. Publication and glory await us. But, in sticking with our B for every p mantra, we decide to calculate an oracle Bayes factor just to be complete. This can easily be done in R using the following 1 line of code:

dnorm(1.96, 1.96, 1)/dnorm(1.96, 0, 1)

And the answer you get is BF = 6.83. This is the absolute maximum Bayes factor you can possibly get for a p value that equals .05 in a t test (you get similar BFs for other types of tests). That is the amount of evidence that would bring a neutral reader who has prior probabilities of 50% for the null and 50% for the alternative to posterior probabilities of 12.8% for the null and 87.2% for the alternative. You might call that moderate evidence depending on the situation. For p of .01, this maximum increases to ~27.5, which is quite strong in most cases. But these values are for the best case ever, where you straight up cheat. When you can’t blatantly cheat the results are not so good.

Soft Maximum

Of course, nobody in their right mind would accept your analysis if you used an oracle prior. It is blatant cheating — but it gives a good benchmark. For p of .05 and the oracle prior, the best BF you can ever get is slightly less than 7. If you can’t blatantly cheat by using an oracle prior, the maximum Bayes factor you can get obviously won’t be as high. But it may surprise you how much smaller the maximum becomes if you decide to cheat more subtly.

The priors most people use for the alternative hypothesis in the Bayes factor are not point hypotheses, but distributed hypotheses. A common recommendation is a unimodal (i.e., one-hump) symmetric prior centered on the null hypothesis value. (There are times where you wouldn’t want to use a prior centered on the null value, but in those cases the maximum BF goes back to being the BF you get using an oracle prior.) I usually recommend using normal distribution priors, and JASP software uses a Cauchy distribution which is similar but with fatter tails. Most of the time the BFs you get are very similar.

So imagine that instead of using the blatantly cheating oracle prior, you use a subtle oracle prior. Instead of a point alternative at the observed mean, you use a normal distribution and pick the scale (i.e., the SD) of your prior to maximize the Bayes factor. There is a formula for this, but the derivation is very technical so I’ll let you read Berger and Sellke (1987, especially section 3) if you’re into that sort of torture.

It turns out, once you do the math, that when using a normal distribution prior the maximum Bayes factor you can get for a value of .05 is BF = 2.1. That is the amount of evidence that would bring a neutral reader who has prior probabilities of 50% for the null and 50% for the alternative to posterior probabilities of 32% for the null and 68% for the alternative. Barely different! That is very weak evidence. The maximum normal prior BF corresponding to of .01 is BF = 6.5. That is still hardly convincing evidence! You can find this bound for any t value you like (for any t greater than 1) using the R code below:

t = 1.96
maxBF = 1/(sqrt(exp(1))*t*exp(-t^2/2))

(You can get slightly different maximum values for different formulations of problem. Another form due to Sellke, Bayarri, & Berger [2001] is 1/[-e*p*ln(p)] for p<~.4, which for p=.05 returns BF = 2.45)

You might say, “Wait no I have a directional prediction, so I will use a half-normal prior that allows only positive values for the population mean. What is my maximum BF now?” Luckily the answer is simple: Just multiply the old maximum by:

2*(1 – p/2)

So for p of .05 and .01 the maximum 1-sided BFs are 4.1 and 13, respectively. (By the way, this trick works for converting most common BFs from 2- to 1-sided.)

Take home message

Do not be surprised if you start reporting Bayes factors and find that what you thought was strong evidence based on a p value of .05 or even .01 translates to a quite weak Bayes factor.

And I think this goes without saying, but don’t try to game your Bayes factors. We’ll know. It’s obvious. The best thing to do is use the prior distribution you find most reasonable for the problem at hand and then do a robustness check by seeing how much the conclusion you draw depends on the specific prior you choose. JASP software can do this for you automatically in many cases (e.g., for the Bayesian t-test; ps check out our official JASP tutorial videos!).

R code

The following is the R code to reproduce the figure, to find the max BF for oracle priors, and to find the max BF for subtle oracle priors. Tinker with it and see how your intuitions match the answers you get!

 

 

Video: “A Bayesian Perspective of the Reproducibility Project: Psychology”

I recently gave a talk at the University of Bristol’s Medical Research Council Integrative Epidemiology Unit, titled, “A Bayesian Perspective on the Reproducibility Project: Psychology,” in which I recount the results from our recently published Bayesian reanalysis of the RPP (you can read it in PLOS ONE). In that paper Joachim Vandekerckhove and I reassessed the evidence from the RPP and found that most of the original and replication studies only managed to obtain weak evidence.

I’m very grateful to Marcus Munafo for inviting me out to give this talk. And I’m also grateful to Jim Lumsden for help organizing. We recorded the talk’s audio and synced it to a screencast of my slides, so if you weren’t there you can still hear about it. 🙂

I’ve posted the slides on slideshare, and you can download a copy of the presentation by clicking here. (It says 83 slides, but the last ~30 slides are a technical appendix prepared for the Q&A)

If you think this is interesting and you’d like to learn more about Bayes, you can check out my Understanding Bayes tutorial series and also our paper, “How to become a Bayesian in eight easy steps.”

Sunday Bayes: Optional stopping is no problem for Bayesians

Optional stopping does not affect the interpretation of posterior odds. Even with optional stopping, a researcher can interpret the posterior odds as updated beliefs about hypotheses in light of data.

–Rouder, 2014 (pdf link)

Sunday Bayes

The format of this series is short and simple: Every week I will give a quick summary of a paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere.

Optional stopping: No problem for Bayesians

Bayesian analysts use probability to express a degree of belief. For a flipped coin, a probability of 3/4 means that the analyst believes it is three times more likely that the coin will land heads than tails. Such a conceptualization is very convenient in science, where researchers hold beliefs about the plausibility of theories, hypotheses, and models that may be updated as new data become available. (p. 302)

It is becoming increasingly common to evaluate statistical procedures by way of simulation. Instead of doing formal analyses, we can use flexible simulations to tune many different parameters and immediately see the effect it has on the behavior of a procedure.

Simulation results have a tangible, experimental feel; moreover, if something is true mathematically, we should be able to see it in simulation as well. (p. 303)

But this brings with it a danger that the simulations performed might be doing the wrong thing, and unless we have a good grasp of the theoretical background of what is being simulated we can easily be misled. In this paper, Rouder (pdf) shows that common intuitions we have for evaluating simulations of frequentist statistics often do not translate to simulations of Bayesian statistics.

The critical element addressed here is whether optional stopping is problematic for Bayesians. My  argument is that both sets of authors use the wrong criteria or lens to draw their conclusions. They evaluate and interpret Bayesian statistics as if they were frequentist statistics. The more germane question is whether Bayesian statistics are interpretable as Bayesian statistics even if data are collected under optional stopping. (p. 302)

When we evaluate a frequentist procedure via simulation, it is common to set a parameter to a certain value and evaluate the number of times certain outcomes occur. For example, we can set the difference between two group means to zero, simulate a bunch of p values, and see how many fall below .05. Then we can set the difference to some nonzero number, simulate a bunch of p values, and again see how many are below .05. The first gives you the type-1 error rate for the procedure, and the second gives you the statistical power. This is appropriate for frequentist procedures because the probabilities calculated are always conditional on one or the other hypothesis being true.

One might be tempted to evaluate Bayes factors in the same way; that is, set the difference between two groups to zero and see how many BFs are above some threshold, and then set the difference to something nonzero and see how many BFs are again above some threshold.

The critical error … is studying Bayesian updating conditional on some hypothetical truth rather than conditional on data. This error is easy to make because it is what we have been taught and grown familiar with in our frequentist training. (p. 308)

Evaluating simulations of Bayes factors in this way is incorrect. Bayes factors (and posterior odds) are conditional on only the data observed. In other words, the appropriate evaluation is: “Given that I have observed this data (i.e., BF = x), what is the probability the BF was generated by H1 vs H0?”

Rouder visualizes this as follows. Flip a coin to choose the true hypothesis, then simulate a Bayes factor, and repeat these two steps many many times. At the end of the simulation, whenever BF=x is observed, check and see how many of these came from one model vs the other. The simulation shows that in this scenario if we look at all the times BF=3 is observed, there will be 3 BFs from the true model to every 1 BF from the false model. Since the prior odds are 1 to 1, the posterior odds equals the Bayes factor.

Screenshot 2016-03-06 12.33.59

You can see in the figure above (taken from Rouder’s figure 2), the distribution of Bayes factors observed when the null is true (purple, projected downwards) vs when the alternative is true (pink, projected upwards). Remember, the true hypothesis was chosen by coin flip. You can clearly see that when a BF of 3 to 1 in favor of the null is observed, the purple column is three times bigger than the pink column (shown with the arrows).

Below (taken from Rouder’s figure 2) you see what happens when one employs optional stopping (e.g., flip a coin to pick underlying true model, then sample until BF favors one model to another by at least 10 or you reach a maximum n). The distribution of Bayes factors generated by each model becomes highly skewed, which is often taken as evidence that conclusions drawn from Bayes factors depend on the stopping rule. The incorrect interpretation would be: Given the null is true, the number of times I find BF=x in favor of the alternative (i.e., in favor of the wrong model) has gone up, therefore the BF is sensitive to optional stopping. This is incorrect because it conditions on one model being true and checks the number of times a BF is observed, rather than conditioning on the observed BF and checking how often it came from H0 vs. H1.

Look again at what matters: What is the ratio of observed BFs that come from H1 vs. H0 for a given BF? No matter what stopping rule is used, the answer is always the same: If the true hypothesis is chosen by a coin flip, and a BF of 10 in favor of the alternative is observed, there will be 10 times as many observed BFs in the alternative column (pink) than in the null column (purple).

Screenshot 2016-03-06 12.40.31

In Rouder’s simulations he always used prior odds of 1 to 1, because then the posterior odds equal the Bayes factor. If one were to change the prior odds then the Bayes factor would no longer equal the posterior odds, and the shape of the distribution would again change; but importantly, while the absolute number of Bayes factors that end up in each bin would change, but the ratios of each pink column to purple column would not. No matter what stopping rule you use, the conclusions we draw from Bayes factors and posterior odds are unaffected by the stopping rule.

Feel free to employ any stopping rule you wish.

This result was recently shown again by Deng, Lu, and Chen in a paper posted to arXiv (pdf link) using similar simulations, and they go further in that they prove the theorem.

A few choice quotes

Page 308:

Optional-stopping protocols may be hybrids where sampling occurs until the Bayes factor reaches a certain level or a certain number of samples is reached. Such an approach strikes me as justifiable and reasonable, perhaps with the caveat that such protocols be made explicit before data collection. The benefit of this approach is that more resources may be devoted to more ambiguous experiments than to clear ones.

Page 308:

The critical error … is studying Bayesian updating conditional on some hypothetical truth rather than conditional on data. This error is easy to make because it iswhat we have been taught and grown familiar with in our frequentist training. In my opinion, the key to understanding Bayesian analysis is to focus on the degree of belief for considered models, which need not and should not be calibrated relative to some hypothetical truth.

Page 306-307:

When we update relative beliefs about two models, we make an implicit assumption that they are worthy of our consideration. Under this assumption, the beliefs may be updated regardless of the stopping rule. In this case, the models are dramatically wrong, so much so that the posterior odds contain no useful information whatsoever. Perhaps the more important insight is not that optional stopping is undesirable, but that the meaningfulness of posterior odds is a function of the usefulness of the models being compared.

Sunday Bayes: Testing precise hypotheses

First and foremost, when testing precise hypotheses, formal use of P-values should be abandoned. Almost anything will give a better indication of the evidence provided by the data against Ho.

–Berger & Delampady, 1987 (pdf link)

Sunday Bayes series intro:

After the great response to the eight easy steps paper we posted, I started a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format is short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next. This paper was voted to be the next in the series.

(I changed the series name to Sunday Bayes, since I’ll be posting these on every Sunday.)

Testing precise hypotheses

This would indicate that say, claiming that a P-value of .05 is significant evidence against a precise hypothesis is sheer folly; the actual Bayes factor may well be near 1, and the posterior probability of Ho near 1/2 (p. 326)

Berger and Delampady (pdf link) review the background and standard practice for testing point null hypotheses (i.e., “precise hypotheses”). The paper came out nearly 30 years ago, so some parts of the discussion may not be as relevant these days, but it’s still a good paper.

They start by reviewing the basic measures of evidence — p-values, Bayes factors, posterior probabilities — before turning to an example. Rereading it, I remember why we gave this paper one of the highest difficulty ratings in the eight steps paper. There is a lot of technical discussion in this paper, but luckily I think most of the technical bits can be skipped in lieu of reading their commentary.

One of the main points of this paper is to investigate precisely when it is appropriate to approximate a small interval null hypothesis by using a point null hypothesis. They conclude, that most of the time, the error of approximation for Bayes factors will be small (<10%),

these numbers suggest that the point null approximation to Ho will be reasonable so long as [the width of the null interval] is one-half a [standard error] in width or smaller. (p. 322)

A secondary point of this paper is to refute the claim that classical answers will typically agree with some “objective” Bayesian analyses. Their conclusion is that such a claim

is simply not the case in the testing of precise hypotheses. This is indicated in Table 1 where, for instance, P(Ho | x) [NB: the posterior probability of the null] is from 5 to 50 times larger than the P-value. (p. 318)

They also review some lower bounds on the amount of Bayesian evidence that corresponds to significant p-values. They sum up their results thusly,

The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

and

the weighted likelihood of H1 is at most [2.5] times that of Ho. A likelihood ratio [NB: Bayes factor] of [2.5] is not particularly strong evidence, particularly when it is [an upper] bound. However, it is customary in practice to view [p] = .05 as strong evidence against Ho. A P-value of [p] = .01, often considered very strong evidence against Ho, corresponds to [BF] = .1227, indicating that H1 is at most 8 times as likely as Ho. The message is simple: common interpretation of P-values, in terms of evidence against precise [null] hypotheses, are faulty (p. 323)

A few choice quotes

Page 319:

[A common opinion is that if] θ0 [NB: a point null] is not in [a confidence interval] it can be rejected, and looking at the set will provide a good indication as to the actual magnitude of the difference between θ and θ0. This opinion is wrong, because it ignores the supposed special nature of θo. A point can be outside a 95% confidence set, yet not be so strongly contraindicated by the data. Only by calculating a Bayes factor … can one judge how well the data supports a distinguished point θ0.

Page 327:

Of course, every statistician must judge for himself or herself how often precise hypotheses actually occur in practice. At the very least, however, we would argue that all types of tests should be able to be properly analyzed by statistics

Page 327 (emphasis original, since that text is a subheading):

[It is commonly argued that] The P-Value Is Just a Data Summary, Which We Can Learn To Properly Calibrate … One can argue that, through experience, one can learn how to interpret P-values. … But if the interpretation depends on Ho, the sample size, the density and the stopping rule, all in crucial ways, it becomes ridiculous to argue that we can intuitively learn to properly calibrate P-values.

page 328:

we would urge reporting both the Bayes factor, B, against [H0] and a confidence or credible region, C. The Bayes factor communicates the evidence in the data against [H0], and C indicates the magnitude of the possible discrepancy.

Page 328:

Without explicit alternatives, however, no Bayes factor or posterior probability could be calculated. Thus, the argument goes, one has no recourse but to use the P-value. A number of Bayesian responses to this argument have been raised … here we concentrate on responding in terms of the discussion in this paper. If, indeed, it is the case that P-values for precise hypotheses essentially always drastically overstate the actual evidence against Ho when the alternatives are known, how can one argue that no problem exists when the alternatives are not known?


Vote for the next entry:

  1. Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
  2. Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
  3. Gallistel (2009) — The Importance of Proving the Null (pdf)
  4. Lindley (2000) — The philosophy of statistics (pdf)

A Bayesian perspective on the Reproducibility Project: Psychology

It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.

–Harold Jeffreys, 1939

Joachim Vandekerckhove (@VandekerckhoveJ) and I have just published a Bayesian reanalysis of the Reproducibility Project: Psychology in PLOS ONE (CLICK HERE). It is open access, so everyone can read it! Boo paywalls! Yay open access! The review process at PLOS ONE was very nice; we had two rounds of reviews that really helped us clarify our explanations of the method and results.

Oh and it got a new title: “A Bayesian perspective on the Reproducibility Project: Psychology.” A little less presumptuous than the old blog’s title. Thanks to the RPP authors sharing all of their data, we research parasites were able to find some interesting stuff. (And thanks Richard Morey (@richarddmorey) for making this great badge)

parasite

TLDR: One of the main takeaways from the paper is the following: We shouldn’t be too surprised when psychology experiments don’t replicate, given the evidence in the original studies is often unacceptably weak to begin with!

What did we do?

Here is the abstract from the paper:

We revisit the results of the recent Reproducibility Project: Psychology by the Open Science Collaboration. We compute Bayes factors—a quantity that can be used to express comparative evidence for an hypothesis but also for the null hypothesis—for a large subset (N = 72) of the original papers and their corresponding replication attempts. In our computation, we take into account the likely scenario that publication bias had distorted the originally published results. Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null. In all cases where the original paper provided strong evidence but the replication did not (15%), the sample size in the replication was smaller than the original. Where the replication provided strong evidence but the original did not (10%), the replication sample size was larger. We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature. We further conclude that traditional sample sizes are insufficient and that a more widespread adoption of Bayesian methods is desirable.

In the paper we try to answer four questions: 1) How much evidence is there in the original studies? 2) If we account for the possibility of publication bias, how much evidence is left in the original studies? 3) How much evidence is there in the replication studies? 4) How consistent is the evidence between (bias-corrected) original studies and replication studies?

We implement a very neat technique called Bayesian model averaging to account for publication bias in the original studies. The method is fairly technical, so I’ve put the topic in the Understanding Bayes queue (probably the next post in the series). The short version is that each Bayes factor consists of eight likelihood functions that get weighted based on the potential bias in the original result. There are details in the paper, and much more technical detail in this paper (Guan and Vandekerckhove, 2015). Since the replication studies would be published regardless of outcome, and were almost certainly free from publication bias, we can calculate regular (bias free) Bayes factors for them.

Results

There are only 8 studies where both the bias mitigated original Bayes factors and the replication Bayes factors are above 10 (highlighted with the blue hexagon). That is, both experiment attempts provide strong evidence. It may go without saying, but I’ll say it anyway: These are the ideal cases. 

(The prior distribution for all Bayes factors is a normal distribution with mean of zero and variance of one. All the code is online HERE if you’d like to see how different priors change the result; our sensitivity analysis didn’t reveal any major dependencies on the exact prior used.)

The majority of studies (46/72) have both bias mitigated original and replication Bayes factors in the 1/10< BF <10 range (highlighted with the red box). These are cases where both study attempts only yielded weak evidence.

Table3

Overall, both attempts for most studies provided only weak evidence. There is a silver/bronze/rusty-metal lining, in that when both study attempts obtain only weak Bayes factors, they are technically providing consistent amounts of evidence. But that’s still bad, because “consistency” just means that we are systematically gathering weak evidence!

Using our analysis, no studies provided strong evidence that favored the null  hypothesis in either the original or replication.

It is interesting to consider the cases where one study attempt found strong evidence but another did not. I’ve highlighted these cases in blue in the table below. What can explain this?

Table3

One might be tempted to manufacture reasons that explain this pattern of results, but before you do that take a look at the figure below. We made this figure to highlight one common aspect of all study attempts that find weak evidence in one attempt and strong evidence in another: Differences in sample size. In all cases where the replication found strong evidence and the original study did not, the replication attempt had the larger sample size. Likewise, whenever the original study found strong evidence and the replication did not, the original study had a larger sample size.

RPP

Figure 2. Evidence resulting from replicated studies plotted against evidence resulting from the original publications. For the original publications, evidence for the alternative hypothesis was calculated taking into account the possibility of publication bias. Small crosses indicate cases where neither the replication nor the original gave strong evidence. Circles indicate cases where one or the other gave strong evidence, with the size of each circle proportional to the ratio of the replication sample size to the original sample size (a reference circle appears in the lower right). The area labeled ‘replication uninformative’ contains cases where the original provided strong evidence but the replication did not, and the area labeled ‘original uninformative’ contains cases where the reverse was true. Two studies that fell beyond the limits of the figure in the top right area (i.e., that yielded extremely large Bayes factors both times) and two that fell above the top left area (i.e., large Bayes factors in the replication only) are not shown. The effect that relative sample size has on Bayes factor pairs is shown by the systematic size difference of circles going from the bottom right to the top left. All values in this figure can be found in S1 Table.

Abridged conclusion (read the paper for more! More what? Nuance, of course. Bayesians are known for their nuance…)

Even when taken at face value, the original studies frequently provided only weak evidence when analyzed using Bayes factors (i.e., BF < 10), and as you’d expect this already small amount of evidence shrinks even more when you take into account the possibility of publication bias. This has a few nasty implications. As we say in the paper,

In the likely event that [the original] observed effect sizes were inflated … the sample size recommendations from prospective power analysis will have been underestimates, and thus replication studies will tend to find mostly weak evidence as well.

According to our analysis, in which a whopping 57 out of 72 replications had 1/10 < BF < 10, this appears to have been the case.

We also should be wary of claims about hidden moderators. We put it like this in the paper,

The apparent discrepancy between the original set of results and the outcome of the Reproducibility Project can be adequately explained by the combination of deleterious publication practices and weak standards of evidence, without recourse to hypothetical hidden moderators.

Of course, we are not saying that hidden moderators could not have had an influence on the results of the RPP. The statement is merely that we can explain the results reasonably well without necessarily bringing hidden moderators into the discussion. As Laplace would say: We have no need of that hypothesis.

So to sum up,

From a Bayesian reanalysis of the Reproducibility Project: Psychology, we conclude that one reason many published effects fail to replicate appears to be that the evidence for their existence was unacceptably weak in the first place.

With regard to interpretation of results — I will include the same disclaimer here that we provide in the paper:

It is important to keep in mind, however, that the Bayes factor as a measure of evidence must always be interpreted in the light of the substantive issue at hand: For extraordinary claims, we may reasonably require more evidence, while for certain situations—when data collection is very hard or the stakes are low—we may satisfy ourselves with smaller amounts of evidence. For our purposes, we will only consider Bayes factors of 10 or more as evidential—a value that would take an uninvested reader from equipoise to a 91% confidence level. Note that the Bayes factor represents the evidence from the sample; other readers can take these Bayes factors and combine them with their own personal prior odds to come to their own conclusions.

All of the results are tabulated in the supplementary materials (HERE) and the code is on github (CODE HERE).


 

More disclaimers, code, and differences from the old reanalysis

Disclaimer:

All of the results are tabulated in a table in the supplementary information (link), and MATLAB code to reproduce the results and figures is provided online (CODE HERE). When interpreting these results, we use a Bayes factor threshold of 10 to represent strong evidence. If you would like to see how the results change when using a different threshold, all you have to do is change the code in line 118 of the ‘bbc_main.m’ file to whatever thresholds you prefer.

#######

Important note: The function to calculate the mitigated Bayes factors is a prototype and is not robust to misuse. You should not use it unless you know what you are doing!

#######

A few differences between this paper and an old reanalysis:

A few months back I posted a Bayesian reanalysis of the Reproducibility Project: Psychology, in which I calculated replication Bayes factors for the RPP studies. This analysis took the posterior distribution from the original studies as the prior distribution in the replication studies to calculate the Bayes factor. So in that calculation, the hypotheses being compared are: H_0 “There is no effect” vs. H_A “The effect is close to that found by the original study.” It also did not take into account publication bias.

This is important: The published reanalysis is very different from the one in the first blog post.

Since the posterior distributions from the original studies were usually centered on quite large effects, the replication Bayes factors could fall in a wide range of values. If a replication found a moderately large effect, comparable to the original, then the Bayes factor would very largely favor H_A. If the replication found a small-to-zero effect (or an effect in the opposite direction), the Bayes factor would very largely favor H_0. If the replication found an effect in the middle of the two hypotheses, then the Bayes factor would be closer to 1, meaning the data fit both hypotheses equally bad. This last case happened when the replications found effects in the same direction as the original studies but of smaller magnitude.

These three types of outcomes happened with roughly equal frequency; there were lots of strong replications (big BF favoring H_A), lots of strong failures to replicate (BF favoring H_0), and lots of ambiguous results (BF around 1).

The results in this new reanalysis are not as extreme because the prior distribution for H_A is centered on zero, which means it makes more similar predictions to H_0 than the old priors. Whereas roughly 20% of the studies in the first reanalysis were strongly in favor of H_0 (BF>10), that did not happen a single time in the new reanalysis. This new analysis also includes the possibility of a biased publication processes, which can have a large effect on the results.

We use a different prior so we get different results. Hence the Jeffreys quote at the top of the page.