The Etz-Files

2015-08-30T11:05:55-05:00

Really interesting work! I’m a philosopher of science, so it’s fascinating to me to see how different statistical analyses of the replication results are getting (somewhat) different overall conclusions.

If you have a better idea for displaying this please leave a comment.

Density plots? http://www.statmethods.net/graphs/density.html; http://docs.ggplot2.org/0.9.3.1/geom_density.html

Reply

2015-08-30T11:31:26-05:00

Great analysis!
This is exactly what I hoped would happen with the data and code.

I would have been surprised if results would have diverged strongly using Bayes factors.

One thing I do not understand is how one can claim a degree of evidence without knowing what was actually predicted and measured. What scientists using frequentist analyses are supposed to do (I’m not saying that they in fact do so) is re-establish the epistemic link between the statistical hypothesis and the research question in terms of the predicted observational constraint, in order to evaluate credibility of the claim.

So, if I predict it is going to rain tomorrow at 12:00 based on information I have today and I am off by 1-2 hours. This will not be considered strong evidence for the credibility of my weather model. If I made this prediction 1 year ago and I am off by the same amount, that would likely be considered a corroboration of my model of the weather.

How can I distinguish between these two situations using Bayes Factors?

Best,
Fred

Reply

2015-08-30T12:55:44-05:00

Hey Fred, thanks for the comment and thanks for your work on the RPP 😀

I think you are asking a really good question. Here’s my take: Bayes factors are interesting and relevant only insofar as we judge them as good representatives of our theories. Since these BFs are based on converted correlation coefficients, they are really only approximate Bayes factors. If there is a lot of information in the model not captured by the standardized effect size, then this approximation may not hold (where r is not an approximately sufficient statistic; complex factorial designs, for example). What I don’t think you’d see is a whole lot of cases where a strong success changes to a strong failure (& vice-versa), but I wouldn’t rule that out if the BF approximation is essentially nil and I essentially compared irrelevant models. This is really just a rough first pass at it, and exact Bayes factors would likely change many of the particular numerical values. Again, how much they change depends on how good of an approximation these measures are. I’d be happy for someone to re-analyze this to check that! and I’m confident there would be some cases where these BFs are overshooting and cases where they are undershooting.

In other words, the conversion to a correlation limits the models’ predictions to a standardized scale. Some researchers understandably don’t like evaluating model prediction (including BFs) on standardized scales (cohen’s d as well) because experimental design plays heavily in the calculation of the metric. If this conversion neuters the models’ connections to their respective theory then these aren’t so good.

In terms of distinguishing between your two scenarios, let me think about it and get back to you 🙂

Reply

2015-08-31T09:55:23-05:00

One additional method for improving predictions here might be to consider reliability estimates for variables and to correct the effect size for attenuation related to measurement-error attenuation by adapting Spearman’s original formulae:

Click to access Spearman%201904%20-%20The%20Proof%20and%20Measurement%20of%20Association%20between%20Two%20Things.pdf

This could lead to finding that two studies found virtually identical findings despite different point estimates.

For example, if Study 1 found an rxy of .35 with reliability of rxx = .9 and an ryy = .9, then the “true” effect corrected for measurement error is rx’y’ = rxy/(rxx * ryy)^.5 = .388. This effect size is equivalent to a replication finding an rxy of .27 with reliability of rxx = .7 and ryy = .7, where rx’y’ correlation = rxy/(rxx * ryy)^.5 = .385.

I doubt this would have a huge effect on predictions (and it shouldn’t if measures have high reliability), but it might be worth considering.

Reply

2015-08-31T12:26:25-05:00

Thanks Matthew, this is a really interesting idea.

“and it shouldn’t if measures have high reliability”

I think you would see a bit of variability in the reliability of measures in a dataset as diverse as this, but it would be interesting to look at how this changes the predictions for a given effect.

Reply

2015-08-31T15:12:27-05:00

Remember that the reliability figures are point estimates and the attenuation corrections make assumptions that may not hold (essentially the corrections tend to work well only if reliability is high and from a large sample – and that assumes that the reliability isn’t different in the study that generated the data and the study that generated the reliability estimate).

One is better of comparing estimates on the original (unstandardised) scale I think.

Reply

2015-08-30T19:50:50-05:00

Reblogged this on Neuroconscience and commented:
Fantastic post by Alexander Etz (@AlxEtz), which uses a Bayes Factor approach to summarise the results of the reproducibility project. Not only a great way to get a handle on those data but also a great introduction to Bayes Factors in general!

Reply

2015-08-30T22:05:18-05:00

Great post (as usual)! I very much like that someone else is fighting back against dichotomiousness (dichotomyness? dichotomousity? The Reign of Dichotomy?) 😛 My main beef with the trendy replication movement right now has always been the inevitable misinterpretation of what replications mean or are supposed to mean. It’s great that there *are* replication attempts but at equal measure we need better education about what information they provide. Anyway a few thoughts:

1. Even though it is dichotomous, I think your histogram of BF categories are fine – but you asked so here is another way to represent those replication BFs. In fact, it reveals some interesting features about BFs that I also discussed in my work-in-progress bootstrapping paper (I will eventually write a new draft. It’s just not high on my list of priorities right now). Here I replotted the BFs from your table as a histogram (open circles) and then added a smoothed histogram (using a kernel density smoother in Matlab). Moreover, I used the logarithm to express the BFs (no other way to represent this – your categorical histogram also does that in a way). The dashed and dotted vertical lines show BFs that would classify as “strong (BF>10 or 100 or <1/100)", respectively.

What this shows is that actually most BFs will cluster relatively close to 1 (that is, zero on this plot) even if they provide reasonably strong evidence. Only a handful results fall way off into excessive values. I discussed that same issue in my manuscript. With BFs (or p-values too to be honest) you can get some very extreme values but should you interpret them as being particularly extreme? On repeated measurement (I know this is a very frequentist concept :P) they are very inconsistent. Perhaps one could therefore argue that categorical labels are useful after all – one could say that it doesn’t matter if a result in your “very strong” category has BF=300,000 or 300 but the interpretation would probably be the same. But I am not sure this is right as this would bring back dichotomous thinking. How likely it would be for a replication to cross the category boundary is probably what matters here pragmatically. I don’t know this yet.

2. You argue very compellingly why it is flawed to make inferences on replications based on whether CIs include the original effect or not. But wouldn’t a sensible way be to calculate the conjoint CI of the original and replication data? I guess this works on the assumption that the findings are homogeneous so conceptual replications or all sorts of minor changes to protocols would arguable not fit. But I assume clever statisticians have already come up with a solution to this too :P. Either way, a conjoint CI would tell you the combined effect size and its uncertainty thus allowing you to make an inference about what the effect actually is. If the replication is very precise but the original effect wasn’t then it will be heavily weighted towards the replication. This seems to be essentially Bayesian to me without being formal about it?

3. Okay, this one is a bit tongue in cheek but do you realise that you are essentially talking like a frequentist when you calculate the success rates based on the different BF classifications at the end of your post? It’s just as I have long suspected that Bayesians really can’t shake frequentist thinking either 😉

Seriously though, while I think hypothesis testing is important, isn’t it more useful to move to estimation when analysing replications? I can see why original studies want to test specific hypotheses but once you accumulated replications you surely must start to be more interest in what the effect size estimate actually is?

Anyway, I’ll be happy to discuss all this when I’m back from Twitterlessness… 😉

Reply

2015-08-30T22:59:22-05:00

Sam, we are antichotomous! Thanks for commenting, I’m eager to have you back on twitter. And thanks for sharing this graph.

1. A tricky part when interpreting Bayes factors is that they don’t scale linearly (since they are multiplicative ratios). So while, yes, a BF of 300k is quite a lot larger than 300, it’s hard to really grasp the consequences of changes of that magnitude. But since they are unitless ratios, we can interpret them as such; we can always say that the first is 1000 times stronger evidence. How compelling these numbers are to someone depends on their personal prior for a specific effect or theory. That’s what I tried to allude to by discussing how skeptics and proponents would react to various results (without invoking actual numbers for their pesky personal priors). And the fact that many are (relatively) near 1 is a function of this dataset. No guarantee this would hold in journal of vision (for example). Or for other specifications of the models, for that matter.

2. I think combining CIs is tricky here. What is the long-run interpretation? The only reason the second CI was calculated was because we wanted to verify the first one. There may be ways to model (simulate?) this dependency between them, but I don’t know how one would. Maybe some variant of a random-effect model or something.

As you say, the goal is not always to get a combined estimate anyways. If I doubt the validity of the first result due to biased reporting processes, why would I want to combine it with a pre-registered (minimally biased) estimate? Also, some of the original CIs were actually pretty narrow, and some replications had smaller N than the original studies, so there is no guarantee that you get a big weight towards your replication. You could build a bayesian model of this by giving different probabilities to the different models and then mixing, but I don’t know how a frequentist could do it and still be a frequentist. Remember, probabilities strictly can’t be assigned to models in that framework.

“This seems to be essentially Bayesian to me without being formal about it?” Sure, let’s just roll with random “sensible” heuristics. Not like that kind of thinking got us into this mess 😛 Principles, Sam! Principles!

3. Finally, I can see why you might think tabulating replication success and failure might look like I’m a dirty frequentist (joking!!). But I’m conditioning on observables (data) in my tabulation, not hypothetical parameters. These rates are descriptive, not prescriptive. 😉

Once you’re confident there is something to estimate, feel free to estimate it.

Reply

2015-08-30T23:48:50-05:00

Good answers thanks! Regarding the conjoint interval, I still think it would probably tell you something. If, as you say, the interval of the original study is narrower than that of the replication then surely it should carry more weight? I do get your point about potential bias in the original though (of course bias could also exist in the replication but this is a story for another day… ;). In this case it indeed makes more sense to see how consistent the replication effect(s) are with the original ones.

Principles, Sam! Principles!
I guess in the end I’ll just always be a pragmatist 😉

I’m eager to have you back on twitter
Considering that I’m commenting here I could theoretically be on twitter right now but I’m trying not to be too over the top in my lawlessness (yes, the attentive reader will have induced correctly that your blog is also banned here, as is mine it would appear. A billion people with no access to Bayesian inference or my inane ramblings. Forget social media – this is the true tragedy here! 😉

Reply

2015-08-31T00:08:10-05:00

‘deduced’ not ‘induced’. Been thinking too much about visual illusions lately… 😛

Reply

2015-09-01T10:38:05-05:00

I just realised I was wrong: your (or my) blog aren’t banned here after all. So all is well I guess… 😛

Reply

2015-08-30T22:06:35-05:00

Ooops, looks like my link to the graph didn’t work (too long since I used html). Here is a direct link:

Reply

2015-08-31T05:05:29-05:00

great post!
But why the assumption that the relevant comparison is to a ‘null’ model of exactly zero effect? Isn’t the relevant comparison to ‘an effect of the smallest size to have any practical or theoretical consequences’? In some contexts this might be anything different from zero, but more often than not it would need to be of a certain size to have any practical or theoretical significance….
Of course, taking this into account will only make the original studies look even worse.

Reply

2015-08-31T12:37:33-05:00

I agree with you, Dimiter. In some cases the researchers are really interested in comparing to clinically/theoretically uninteresting effect sizes. In most cases this would shrink a given Bayes factor, since the null can now account for a wider range of observables. By how much depends on the context of the specific effect and how large of an effect size is considered relevant, of course.

I don’t have a problem with non-point nulls in general, and if an analyst thinks it’s reasonable and theoretically motivated then they should feel free to use one. In practice, for this dataset, it would take a considerable amount of work to implement this on a case by case basis but for a smaller “reproducibility project” (with just a few studies) it wouldn’t be too hard. As I say in the post, “there is no single Bayesian answer to any question”, precisely because one can always find other reasonable ways to formulate the problem. This is one such way.

Reply

2015-08-31T10:46:45-05:00

[…] The Bayesian Reproducibility Project […]

Reply

2015-08-31T22:14:28-05:00

I just wanted to say that I loved this post.
The first thing I thought after reading the reproducibility project was “why on Earth are they not employing Bayesian methods for this question?”.

I care little about frequentist approaches to this project. To be frank, it seemed like the authors were attempting to solve a largely frequentist problem using more frequentist problems. As though they were digging around in the frequentist toolbox for answers to their problem, and wound up hammering in a nail with a screw driver because “close enough”.

Although their estimates were not far off of yours, what I *really* want to see is how parameter estimates have changed in light of new data, not some dichotomous ‘did it work again’ decision. Bayes factors get us closer if nothing else.

Thank you for this post!

Reply

2015-08-31T22:58:27-05:00

Stephen, thank you for the very kind words. I think I just put into words what everyone was thinking, Namely, “Surely it can’t be so black and white?” And I’m sure we’ll be seeing many more interpretations of these results in the coming weeks/months/years.

In my opinion, as I said in a reply to Sam above, I think you should start estimating things once you’re confident there is something there to estimate. If someone believes the null is always false, and we should always estimate everything, I understand that. Essentially what that is saying is they give the null model a probability of approximately zero. Reasonable enough, even if I wouldn’t do that. (These are personal probabilities, after all.)

If, however, you don’t have complete disregard for the null, then I think the best estimates we could get here are from averaging the different model estimates based on their posterior probability. Jeff Rouder wrote a good piece on that a while back: http://jeffrouder.blogspot.com/2015/03/estimating-effect-sizes-requires-some.html

I’m curious to hear how people would want to go about estimating these effects. Would you try to extract a bias-reduced estimate from the original studies and then do a precision-weighted average, or give extra weight to the pre-registered (low-bias) replication attempts, or disregard the original estimates from studies with strong failures to replicate, etc etc etc? Have you given it much thought? It seems complicated to me, and I don’t know exactly how I would do it, so I’d be really interested to hear what you think.

Reply

2015-09-01T00:36:44-05:00

I’m still not sure about this. Even if you don’t believe in the effect’s existence, couldn’t you still estimate the parameter to see how far away from zero it is? Only if your parameter estimate after replication is sufficiently beyond the null range of no interest would you then infer that effect was meaningful. I suppose though that this isn’t all that far from the non-zero null models you discussed above so you could do BFs on that too.

I agree that the bias-reduction aspect is a complex one but it seems important. I would assume that if the posterior of the original study incorporated the bias a replication result going in the same directly but with weaker effect size estimate would actually be stronger evidence for the effect’s existence than using the “raw” posterior?

Reply

2015-09-01T00:49:55-05:00

I think looking at how far away parameter estimates are from null values is just a crude way to do a hypothesis test. If you want to do that, then do a test with principles behind it! It’s just as easy to implement a bayes factor, but BFs actually follow from the laws of probability.

I think incorporating the bias into the posterior of the original study is one way you could do it. Maime Guan and Joachim Vandekerckhove just wrote a paper about mitigating bias in effect sizes through averaging the estimates from different bias-generating models http://www.cidlab.com/prints/guan2015bayesian.pdf which is one way you might do that. Still some work to be done on that front though.

In general, if you correct for bias by having the replication prior localized around smaller effects, you would indeed have more favorable bayes factors. The replication BF on the raw posterior can be thought of as highly biased against small effects. Essentially it penalizes you for biasing your original estimates with a bias in this test. The question it asks is: Is the replication effect roughly as large as before or null? If you make that “as large as before” smaller, smaller replication effect sizes will indeed fit better.

Reply

2015-09-01T06:35:54-05:00

I think it would be definitely worth looking more into this and/or promoting this more. Of course in some cases the bias-corrected posterior will probably look more or less like no effect anyway… 😉

Reply

2015-09-01T14:54:37-05:00

Yes, I imagine we’d see many published estimates shrink by quite a lot.

Reply

2015-09-01T06:10:58-05:00

[…] эффекта, а какие оказались неинформативны. Выводы Алекса Этза[8] в рамках этого подхода оказались довольно […]

Reply

2015-09-01T13:15:52-05:00

Great post. In regard to how to display the replication Bayes Factors, I recommend the empirical cumulative distribution function, with the “weight of the evidence” (common log of the Bayes Factor) as the x axis. Generally speaking, the cumulative distribution is THE way to display raw data. There is no smoothing or averaging. You can immediately see the location, spread, form and range of the distribution and whether it was censored. In the graph I will attempt to attach, made from the BFs in your table, the data were not censored, but I limited the x-axis range to the portion of greatest interest (-3 to 3). I don’ know how to use Matlab’s publish command to get a figure to appear in a comment. Below is the Matlab code that generated my figure

% Code assumes Dat variable in Matlab’s workspace. The Dat variable is a
% vector of the 95 replication BFs in ETZ’s table
figure
H = cdfplot(log10(Dat));
set(H,’LineWidth’,2)
xlabel(‘Weight of Replication Evidence [log_1_0(BF)]’)
ylabel(‘Cumulative Fraction of Replications’)
hold on
plot([0 0],ylim,’r–‘,’LineWidth’,2)

title(‘Cumulative Distribution’,’FontSize’,14)
xlim([-3 3])
plot([log10(1/3) log10(1/3)],ylim,’k:’,[-1 -1],ylim,’k-.’,[-2 -2],ylim,’k-‘)
plot([log10(3) log10(3)],ylim,’k:’,[1 1],ylim,’k-.’,[2 2],ylim,’k-‘)
text(-2.9,.5,’Decisive’)
text(-1.9,.5,’Strong’)
text(-.95,.5,’Good’)
text(-.3,.5,’Weak’)
text(2.1,.5,’Decisive’)
text(1.1,.5,’Strong’)
text(.5,.5,’Good’)

text(.1,.9,’Favors Original’,’FontSize’,14)
text(-1.1,.9,’Favors Null’,’FontSize’,14)

Published with MATLAB® R2014b

Reply

2015-09-01T22:49:29-05:00

I agree this is quite a nice way to represent this. I created the plot here:

Reply

2015-09-01T22:53:49-05:00

Wow, this is a really nice way of looking at it. Thanks Randy!

And thanks for posting it Sam! I don’t have MATLAB. v_v

Reply

2015-09-02T00:21:31-05:00

On this note, one fine day I will perhaps reinstall R 😉

Reply

2015-09-02T10:13:33-05:00

[…] some debate about the methodology of the study — see, for example, the excellent post by Alexander Etz who suggests a Bayesian approach instead of classifying each replication attempt into a success vs. […]

Reply

2015-09-04T05:21:18-05:00

When an original study and a replication (designed to match in power) have wildly different confidence interval widths, I’m interested. If this keeps happening, it tells us we’re bad at estimating confidence — almost certainly being over-certain in published results. Am I correct in thinking Bayes factors are not interested in diagnosing this specifically?

For example, if the original had a confident effect, say CI [4.9, 5.1], and the replication says an uninformative [-100, 100], that gives a Bayes factor near 1. A 50 or 1/50 Bayes factor implies the replication gave a good amount of information, but the converse is not true, i.e. 1 doesn’t imply the replication was uninformative, if I understand correctly. I haven’t solved for this, but I believe there’s a locus of (interval width, interval center) that gives Bayes factor = 1.

In evaluating the replication, there are two dimensions, agreement between the studies’ statements and similarity in strength of statements. (Not orthogonal when stated that way, but there are two dimensions.) Bayes factor is one-dimensional by design, so it doesn’t expect to distinguish the two independently. Ah, rereading you noted the Bayes factor “disentangles the two types of results that traditional significance tests struggle with: a result that actually favors the null model vs a result that is simply insensitive”. It does distinguish those two cases in a useful way, but it doesn’t distinguish between “simply insensitive” and “says something, but something in between the original and the null model”, correct? Which I’d like to distinguish, for motivation above.

You know, I quite like the paper’s scatterplot of the originals’ and replications’ effect sizes, I would just want to draw the intervals (in both directions) on there too.

(I’m a bystander in experimental design and in prob/stats, so please forgive blunders, but thanks for your intriguing write-up.)

Reply

2015-09-04T12:43:23-05:00

Tangent,

It’s important to remember what question the Bayes factor is trying to answer: Given these two models, how much better does the replication data fit one model or the other? If the data are so variable that their standard error is 1000 times that of the original experiment (in your example, 50 vs .05), the data fit both models very poorly and the BF will be near 1, as you note. Sensitivity is relative.

You say, “A 50 or 1/50 Bayes factor implies the replication gave a good amount of information, but the converse is not true, i.e. 1 doesn’t imply the replication was uninformative”

My answer is that a BF of 1 always means the same thing. It always means the data was uninformative *with respect to the models being compared*. There are an infinite number of ways to achieve a BF near 1, but they all indicate that the data do not clearly favor one of the models. In technical terms, it means that the probability of the data (i.e., the marginal likelihood) under both models is approximately the same. The question of *why* the data were insensitive cannot be answered by the Bayes factor. That’s an experimental design question, not an inferential question. I tried to address this in the first paragraph of the “Try out this method” section.

You say, “[the bayes factor] doesn’t distinguish between “simply insensitive” and “says something, but something in between the original and the null model””

First, sensitivity is relative to the models in question. Data with a SE of 50 (your insensitive example) are quite sensitive if the models have variability on the order of 5000. When compared to models with variability on the order of .05, they’re insensitive. Bayesian inference does not entail statements of absolutes except these three: Everything is relative, everything is conditional, and all inferences must follow from the laws of probability.

Second, remember, the question is about comparing the relative fit of these two particular models. This does not preclude you from introducing a third intermediate model that you think would predict the data better. In fact, there will almost always be a third model that fits the data better than the two models under consideration. The question is not whether this model exists (it almost certainly does), but whether it was motivated by theory and not crafted in response to the data being tested. You cannot use the data to create a model that then tests the same data, or you’ll use the data twice. Models constructed in response to these replication data would need to be tested on a new batch of data.

Reply

2015-09-08T15:37:18-05:00

[…] The Bayesian Reproducibility Project […]

Reply

2016-02-26T16:15:21-06:00

[…] perspective on the Reproducibility Project: Psychology.” A little less presumptuous than the old blog’s title. Thanks to the RPP authors sharing all of their data, we research parasites were able to […]

Reply

2016-03-02T16:34:55-06:00

[…] of the Reproducibility Project, Alexander Etz produced a great Bayesian reanalysis of the data from that project (possible because it is all open access, via the Open Science […]

Reply

2016-03-06T08:21:25-06:00

[…] of the Reproducibility Undertaking, Alexander Etz produced a fantastic Bayesian reanalysis of the info from that undertaking (potential as a result of it’s all open entry, by way of […]

Reply

2016-03-20T09:20:41-05:00

[…] confidence interval of the replication. Both arguments can lead to some fairly peculiar results. An early criticism of the initial Reproducibility Project paper suggested a Bayesian approach to testing reproducibility but that had its own […]

Reply

2016-03-28T05:53:05-05:00

[…] 1. Learn about Bayes, because this and this. […]

Reply

	Study# N_orig N_rep r_orig r_rep bfRep rep_pval bin # code
	110 278 142 0.55 0.09 3.84E-06 0.277 1 Very strong
	97 73 1486 0.38 -0.04 1.35E-03 0.154 2 Strong
	8 37 31 0.56 -0.11 1.63E-02 0.540 2 Strong
	4 190 268 0.23 -0.01 2.97E-02 0.920 2 Strong
	65 41 131 0.43 0.01 3.06E-02 0.893 2 Strong
	93 83 68 0.32 -0.14 3.12E-02 0.265 2 Strong
	81 90 137 0.27 -0.10 3.24E-02 0.234 2 Strong
	151 41 124 0.40 0.00 4.52E-02 0.975 2 Strong
	7 99 14 0.72 0.13 5.04E-02 0.314 2 Strong
	148 194 259 0.19 -0.03 5.24E-02 0.628 2 Strong
	106 34 45 0.38 -0.22 6.75E-02 0.132 2 Strong
	48 92 192 0.23 -0.05 7.03E-02 0.469 2 Strong
	56 99 38 0.38 -0.04 7.54E-02 0.796 2 Strong
	49 34 86 0.38 -0.03 7.96E-02 0.778 2 Strong
	118 111 158 0.21 -0.05 8.51E-02 0.539 2 Strong
	124 34 68 0.38 -0.03 9.07E-02 0.778 2 Strong
	61 108 220 0.22 0.00 9.56E-02 0.944 2 Strong
	3 24 31 0.42 -0.22 1.03E-01 0.229 3 Moderate
	165 56 51 0.28 -0.18 1.05E-01 0.210 3 Moderate
	149 194 314 0.19 0.02 1.15E-01 0.746 3 Moderate
	87 51 47 0.40 0.01 1.22E-01 0.929 3 Moderate
	155 50 69 0.31 -0.03 1.30E-01 0.778 3 Moderate
	104 236 1146 0.13 0.02 1.59E-01 0.534 3 Moderate
	115 31 8 0.50 -0.45 1.67E-01 0.192 3 Moderate
	72 257 247 0.21 0.04 1.68E-01 0.485 3 Moderate
	68 116 222 0.19 0.00 1.69E-01 0.964 3 Moderate
	64 76 65 0.43 0.11 1.76E-01 0.426 3 Moderate
	136 28 56 0.50 0.10 1.76E-01 0.445 3 Moderate
	129 26 64 0.37 0.02 1.91E-01 0.888 3 Moderate
	39 68 153 0.37 0.10 2.23E-01 0.211 3 Moderate
	20 94 106 0.22 0.02 2.54E-01 0.842 3 Moderate
	53 31 73 0.38 0.08 2.71E-01 0.513 3 Moderate
	153 7 7 0.86 0.12 2.87E-01 0.758 3 Moderate
	58 182 278 0.17 0.04 3.01E-01 0.540 3 Moderate
	150 13 18 0.72 0.21 3.04E-01 0.380 3 Moderate
	140 81 122 0.23 0.04 3.06E-01 0.787 3 Moderate
	63 68 145 0.27 0.07 3.40E-01 0.374 4 Insensitive
	71 373 175 0.22 0.07 3.41E-01 0.332 4 Insensitive
	1 13 28 0.59 0.15 3.49E-01 0.434 4 Insensitive
	5 31 47 0.46 0.13 3.57E-01 0.356 4 Insensitive
	28 31 90 0.34 0.10 4.52E-01 0.327 4 Insensitive
	161 44 44 0.48 0.18 4.56E-01 0.237 4 Insensitive
	2 23 23 0.61 0.23 4.89E-01 0.270 4 Insensitive
	22 93 90 0.22 0.06 5.07E-01 0.717 4 Insensitive
	55 54 68 0.23 0.07 5.33E-01 0.742 4 Insensitive
	154 67 13 0.43 0.11 5.58E-01 0.690 4 Insensitive
	143 108 150 0.17 0.06 5.93E-01 0.678 4 Insensitive
	89 26 26 0.14 0.03 6.81E-01 0.882 4 Insensitive
	167 17 21 0.60 0.25 7.47E-01 0.242 4 Insensitive
	52 131 111 0.21 0.09 8.29E-01 0.322 4 Insensitive
	12 92 232 0.18 0.08 8.97E-01 0.198 4 Insensitive
	43 64 72 0.35 0.16 9.58E-01 0.147 4 Insensitive
	107 84 156 0.22 0.10 9.74E-01 0.189 4 Insensitive
	80 43 67 0.26 0.16 1.24E+00 0.190 5 Insensitive
	86 82 137 0.21 0.12 1.30E+00 0.141 5 Insensitive
	44 67 176 0.35 0.15 1.40E+00 0.045 5 Insensitive
	132 69 41.458 0.25 0.18 1.44E+00 0.254 5 Insensitive
	37 11 17 0.55 0.35 1.59E+00 0.142 5 Insensitive
	26 94 92 0.16 0.14 1.83E+00 0.166 5 Insensitive
	120 28 40 0.38 0.25 1.98E+00 0.053 5 Insensitive
	50 92 103 0.21 0.16 2.22E+00 0.079 5 Insensitive
	146 14 11 0.65 0.50 2.60E+00 0.084 5 Insensitive
	84 52 150 0.50 0.22 2.94E+00 0.008 5 Insensitive
	19 31 19 0.56 0.40 3.01E+00 0.071 6 Moderate
	33 39 39 0.52 0.32 3.20E+00 0.044 6 Moderate
	82 47 41 0.30 0.27 3.21E+00 0.086 6 Moderate
	73 37 120 0.32 0.20 4.55E+00 0.028 6 Moderate
	24 152 48 0.36 0.28 5.32E+00 0.045 6 Moderate
	6 23 31 0.59 0.40 5.89E+00 0.023 6 Moderate
	25 48 63 0.35 0.27 6.65E+00 0.002 6 Moderate
	94 26 59 0.34 0.29 6.73E+00 0.012 6 Moderate
	111 55 116 0.33 0.23 9.22E+00 0.014 6 Moderate
	112 9 9 0.70 0.75 1.17E+01 0.008 7 Strong
	11 21 29 0.67 0.47 1.29E+01 0.008 7 Strong
	133 23 37 0.45 0.42 1.98E+01 0.007 7 Strong
	127 28 25 0.69 0.53 2.40E+01 0.005 7 Strong
	29 7 14 0.74 0.70 3.32E+01 0.002 7 Strong
	32 36 37 0.62 0.48 5.43E+01 0.002 7 Strong
	117 660 660 0.13 0.12 8.57E+01 0.000 7 Strong
	27 31 70 0.38 0.38 1.10E+02 0.001 8 Very strong
	36 20 20 0.71 0.68 1.97E+02 0.000 8 Very strong
	17 76 72.4 0.30 0.43 7.21E+02 0.000 8 Very strong
	15 94 241 0.20 0.25 8.66E+02 0.000 8 Very strong
	116 172 139 0.29 0.32 1.30E+03 0.000 8 Very strong
	114 30 30 0.57 0.65 1.39E+03 0.000 8 Very strong
	158 38 93 0.37 0.41 2.35E+03 0.000 8 Very strong
	145 76 36 0.77 0.65 5.93E+03 0.000 8 Very strong
	13 68 68 0.52 0.52 2.89E+04 0.000 8 Very strong
	122 7 16 0.72 0.92 5.38E+04 0.000 8 Very strong
	10 28 29 0.70 0.78 1.60E+05 0.000 8 Very strong
	121 11 24 0.85 0.83 1.88E+05 0.000 8 Very strong
	135 562 3511.1 0.005 0.11 1.19E+07 0.000 8 Very strong
	134 115 234 0.21 0.50 2.20E+12 0.000 8 Very strong
	142 162 174 0.59 0.61 1.58E+17 0.000 8 Very strong
	113 124 175 0.68 0.76 1.34E+32 0.000 8 Very strong


	## from the reproducibility project code here https://osf.io/vdnrb/
	#make sure this file is in your working directory
	info <- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
	MASTER <- read.csv("rpp_data.csv")[1:167, ]
	colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file

	studies<-MASTER$ID[!is.na(MASTER$T_r..O.) & !is.na(MASTER$T_r..R.)] ##to keep track of which studies are which
	studies<-studies[-31]##remove the problem studies (46 and 139)
	studies<-studies[-80]

	orig<-MASTER$T_r..O.[studies] ##read in the original rs that have matching rep rs
	rep<-MASTER$T_r..R.[studies] ##read in the rep rs that have matching original rs

	N.R<-MASTER$T_N_R_for_tables[studies] ##n of replications for analysis
	N.O<-MASTER$T_N_O_for_tables[studies] ##n of original studies for analysis

	p<-MASTER$T_pval_USE..R.[studies] #extract p-values for the studies

	bfRep<- numeric(length=95) #prepare for running replications against original study posterior

	#download the code for replication functions from here https://osf.io/v7nux/ and load functions into globale environment
	for(i in 1:95){
	bfRep[i]<- repBfR0(nOri=N.O[i],rOri=orig[i],nRep=N.R[i],rRep=rep[i])
	}

	#bfRep Remove hash to disply BFs

	#create bin numbers
	bfstrength<-seq(1,8,1)
	#create labels for the bins
	barlabels<-c("BF<1/100","1/100<BF<1/10","1/10<BF<1/3","1/3<BF<1", "1<BF<3","3<BF<10","10<BF<100","BF>100")
	#create category labels to add to bins
	g<-c("Very strong","Strong","Moderate","Insensitive","Insensitive", "Moderate","Strong","Very strong")
	barlabels<-as.vector(barlabels)


	#create new dummy variables for different BFs in the categories
	bf<-numeric(length=95)
	for(i in 1:95){
	if(bfRep[i]<.001){ #BF<1/100
	bf[i]<-bfstrength[1]
	}
	if(bfRep[i]<.1 & bfRep[i]>=.001){ #1/100<BF<1/10
	bf[i]<-bfstrength[2]
	}
	if(bfRep[i]<.3333 & bfRep[i]>=.1){ #1/10<BF<1/3
	bf[i]<-bfstrength[3]
	}
	if(bfRep[i]<1 & bfRep[i]>=.3333){ #1/3<BF<1
	bf[i]<-bfstrength[4]
	}
	if(bfRep[i]>1 & bfRep[i]<=3){ #1<BF<3
	bf[i]<-bfstrength[5]
	}
	if(bfRep[i]>3 & bfRep[i]<=10){ #3<BF<10
	bf[i]<-bfstrength[6]
	}
	if(bfRep[i]>10 & bfRep[i]<=100){ #10<BF<100
	bf[i]<-bfstrength[7]
	}
	if(bfRep[i]>100){ #BF>100
	bf[i]<-bfstrength[8]
	}
	}

	table(bf) #shows counts for each bin

	#plot the bins
	barplot(table(bf),names.arg=barlabels,border="gray80",col="paleturquoise1",
	xlab="Replication Bayes Factor Categories",las=1,
	main="Bayesian Replication Outcomes from the Reproducibility Project: Psychology",
	ylim=c(0,25), sub="BFs > 1 are evidence in favor of the original effect")
	arrows(x0=4.8,x1=-.1,y0=21,y1=21,col="red",lwd=5) #replication strength arrow
	arrows(x0=5,x1=9.8,y0=21,y1=21,col="darkorchid1",lwd=5) #ditto
	text("STRONGER REPLICATION 'FAILURE'",y=22,x=2.5,cex=1.5,col="red") #add text above arrows
	text("STRONGER REPLICATION 'SUCCESS'",y=22,x=7.5,cex=1.5,col="darkorchid1") #ditto
	text(g,x=c(.7,1.9,3.1,4.3,5.5,6.7,7.9,9.1),y=rep(.5,8)) #add verbal labels to the bottom of bins (e.g., very strong)

	k=c(2/95,16/95,20/95,17/95,9/95,8/95,8/95,16/95) #percent of reps in each bin

	tab2<-cbind(studies,N.O,N.R,orig,rep,bfRep,p,bf,g) #create matrix of values

The Etz-Files

Data science, statistics, and psychology

The Bayesian Reproducibility Project

A Bayesian metric of reproducibility

Replication Bayes factors

So how did we do?

Strong replication failures and strong successes

Moderate replication failures and moderate successes

Many uninformative “failed” replications

Wrap up

Try out this method!

Acknowledgements and thanks

Notes

Results

R Code

References

36 thoughts on “The Bayesian Reproducibility Project”

Leave a comment Cancel reply

A Bayesian metric of reproducibility

Replication Bayes factors

So how did we do?

Strong replication failures and strong successes

Moderate replication failures and moderate successes

Many uninformative “failed” replications

Wrap up

Try out this method!

Acknowledgements and thanks

Notes

Results

R Code

References

Rate this:

Share this:

Related

36 thoughts on “The Bayesian Reproducibility Project”

Leave a comment Cancel reply