May | 2014 | The Etz-Files

A lot of psychologists are in a bit of a tiff right now. I think everyone agrees that replications are important, but it doesn’t seem like there is a consensus for how it should go about (For many perspectives, see: here, here, here, here, here, here, here). Since Sanjay asked for more perspectives from people who aren’t tenured, I figured I’d write up my experience with replication. Take note, I graduated but I am not in graduate school yet, so I am one vulnerable puppy. Luckily my experience was very civil.

During my junior/senior year fellowship, I ran 2 identical direct replications of a psychophysics experiment and both were disappointing. I wasn’t the first person in the lab to try to replicate it either: the addition of my “failures” made it 5 collective unsuccessful replications. At what point do you throw in the towel and say, “We’re never gonna get it”? I went on to manipulate the stimuli and task and ended up finding some cool results, but the taste of sour data was still in my mouth. The worst part was that I had to slap my “failures to replicate” on a poster and travel cross-country to present them at a conference. I was nervous before presenting, because how are you supposed to explain failures to replicate in psychophysics? It’s not like social psych, where one can point to the specter of “unknown moderators” (no offense, that’s my field now).

So, how did the conference go? Very well I should think. I was not surprised by some reactions I got from viewers when I said those dreaded words, “failed to replicate,” on the order of: “Oh wow, that sucks for them,” “Welp, that’s never good,” “Oh no! He’s in my department…..that’s embarrassing,” “Did you really try 5 times? I would have stopped after 1.” The most stress-inducing part of the whole thing was when the person I was failing to replicate came up and introduced himself. I was expecting hurt feelings, or animosity. What I got was a reasonable reply from a senior in my field. He said, “Well, that’s really too bad. You never got it in 5 tries? Hmmm…. I guess we might have overestimated how robust that effect is. It could be that it is just a weak effect. We’ve moved on since then to show the effect with other stimuli but we haven’t done this exact setup again, maybe we should. Thanks for sharing with me, if you write up the manuscript I’d love it if you sent it to me when it’s done.”

What a reasonable guy. I was expecting barred teeth and a death stare, but what I got was a senior in the field who was open to revising his beliefs.

One thing to note: his comment, “if you write up the manuscript I’d love it if you sent it to me when it’s done (emphasis added)” really highlights the view that replications are likely to be dropped if they “fail.” Hopefully this special issue can change the culture and change that if to when. Thanks to Daniel Lakens (@lakens) and Brian Nosek (@BrianNosek) for trailblazing.

I’ve recently run an experiment where I train kids on a computer task and see how they improve after a practice session. We want to see if the kids improve more as they get older, and so we calculate the correlation between the kids’ ages (in months) and their improvement scores.¹ If we tested 40 kids and found a correlation of .30, how much should we trust our result? I did some simulations to find out. This was inspired by a recent paper by Stanley and Spence (2014).

A common way to represent the uncertainty present in an estimate is to calculate the confidence interval (usually 95%) associated with that estimate. Shorter intervals mean less uncertainty in the estimate. The calculations for 95% confidence intervals ensure that, in the very long run, 95% of your intervals will capture the true population value. The key here is that only through repeated sampling can you be confident that most of your intervals will be in range of that true population value. For example, if I find my correlation estimate is r=.30 95%CI [-.01, .56] then presumably the true correlation could be anywhere in that range, or even outside of it if I’m unlucky (you really never know). That can say a lot to some stats people, but I like to see what it actually looks like through simulation.

Below are results for 500 samples of 40 participants each when the population correlation is .10, .30, or .50 (signified by the vertical line- that weird p on the axis is called rho) and beside each is an example of what that correlation might look like. You can click the picture to see it larger. Each sample is pulling from the same population, meaning that the variation is only due to sampling error. Each green dot is a sample whose correlation estimate and 95% interval capture the true correlation, red dots are samples that fail to capture the true correlation. As you can see, with 40 participants in each sample there is a lot of variation. Imagine your experiment as picking one of these dots at random.

Some quick observations: 1) most samples fall fairly close to the true value, 2) the range on all of these samples is huge. So with 40 subjects in our sample, a correlation of .30 is green in each of these different populations. How can we know which one we are actually sampling from if we only have our single sample? One commonly proposed solution is to take larger samples. If each sample consisted of 80 participants instead of 40, we would expect the sample heaps to be narrower. But how does it change our interpretation of our example r=.30? With n=80, the 95% CI around .30 ranges from .09 to .49.

Now with n=80 the interpretation of our result only changes slightly. When the true correlation is .10 our r=.30 is still just ever so slightly green; remember, our 95% CI ranged as low as .09. However, when the true correlation is .50 our r=.30 is now red; our 95% CI ranged only as high as .49. But remember, when you only have your single sample you don’t know what color it is- Always remember that it could be red! In the very long run 5% of all samples will be red.

So what is the takeaway from all of this? When we doubled our sample size from n=40 to n=80, our 95% CI shrunk from [-.01, .56] to [.09, .49]- at least we can tentatively rule out a sign error² when we double the sample. That really isn’t much. And when you look at the figures, the range of estimates gets smaller for each respective population- but not much changes in terms of our interpretation of that single r=.30. That really sucks. It’s hard enough for me to collect data and train 40 preschoolers, let alone 80. But even if I did double my efforts I wouldn’t get much out of it! That really really sucks.

There is no happy ending to this blog post. This seems bleak! Hopefully other people can replicate my study’s findings and we can pool our knowledge to end up with more informative estimates of these effects.

Thanks for reading, please comment!

¹For those who need a refresher on correlations, a correlation can be negative (an increase in age corresponds with a decrease in improvement score) or positive (increase in age -> increase improvement score). The range goes from -1, meaning a perfect negative correlation, to +1, meaning a perfect positive correlation. Those never actually happen in real experiments.

²A sign error is claiming with confidence that the true correlation is positive when in fact it is negative, or vice versa.

The Etz-Files

Data science, statistics, and psychology

Month: May 2014

An undergraduate’s experience with replications

Musings on correlations- doubling my sample size doesn’t help much

Share this:

Share this: