I’ve recently run an experiment where I train kids on a computer task and see how they improve after a practice session. We want to see if the kids improve more as they get older, and so we calculate the correlation between the kids’ ages (in months) and their improvement scores.¹ If we tested 40 kids and found a correlation of .30, how much should we trust our result? I did some simulations to find out. This was inspired by a recent paper by Stanley and Spence (2014).

A common way to represent the uncertainty present in an estimate is to calculate the confidence interval (usually 95%) associated with that estimate. Shorter intervals mean less uncertainty in the estimate. The calculations for 95% confidence intervals ensure that, in the very long run, 95% of your intervals will capture the true population value. The key here is that only through repeated sampling can you be confident that most of your intervals will be in range of that true population value. For example, if I find my correlation estimate is r=.30 95%CI [-.01, .56] then presumably the true correlation could be anywhere in that range, or even outside of it if I’m unlucky (you really never know). That can say a lot to some stats people, but I like to see what it actually looks like through simulation.

Below are results for 500 samples of 40 participants each when the population correlation is .10, .30, or .50 (signified by the vertical line- that weird p on the axis is called rho) and beside each is an example of what that correlation might look like. You can click the picture to see it larger. Each sample is pulling from the same population, meaning that the variation is only due to sampling error. Each green dot is a sample whose correlation estimate and 95% interval capture the true correlation, red dots are samples that fail to capture the true correlation. As you can see, with 40 participants in each sample there is a lot of variation. Imagine your experiment as picking one of these dots at random.

Some quick observations: 1) most samples fall fairly close to the true value, 2) the range on all of these samples is huge. So with 40 subjects in our sample, a correlation of .30 is green in each of these different populations. How can we know which one we are actually sampling from if we only have our single sample? One commonly proposed solution is to take larger samples. If each sample consisted of 80 participants instead of 40, we would expect the sample heaps to be narrower. But how does it change our interpretation of our example r=.30? With n=80, the 95% CI around .30 ranges from .09 to .49.

Now with n=80 the interpretation of our result only changes slightly. When the true correlation is .10 our r=.30 is still just ever so slightly green; remember, our 95% CI ranged as low as .09. However, when the true correlation is .50 our r=.30 is now red; our 95% CI ranged only as high as .49. But remember, when you only have your single sample you don’t know what color it is- Always remember that it could be red! In the very long run 5% of all samples will be red.

So what is the takeaway from all of this? When we doubled our sample size from n=40 to n=80, our 95% CI shrunk from [-.01, .56] to [.09, .49]- at least we can tentatively rule out a sign error² when we double the sample. That really isn’t much. And when you look at the figures, the range of estimates gets smaller for each respective population- but not much changes in terms of our interpretation of that single r=.30. That really sucks. It’s hard enough for me to collect data and train 40 preschoolers, let alone 80. But even if I did double my efforts I wouldn’t get much out of it! That really really sucks.

There is no happy ending to this blog post. This seems bleak! Hopefully other people can replicate my study’s findings and we can pool our knowledge to end up with more informative estimates of these effects.

Thanks for reading, please comment!

¹For those who need a refresher on correlations, a correlation can be negative (an increase in age corresponds with a decrease in improvement score) or positive (increase in age -> increase improvement score). The range goes from -1, meaning a perfect negative correlation, to +1, meaning a perfect positive correlation. Those never actually happen in real experiments.

²A sign error is claiming with confidence that the true correlation is positive when in fact it is negative, or vice versa.