The Special One-Way ANOVA (or, Shutting up Reviewer #2)

The One-Way Analysis of Variance (ANOVA) is a handy procedure that is commonly used when a researcher has three or more groups that they want to compare. If the test comes up significant, follow-up tests are run to determine which groups show meaningful differences. These follow-up tests are often corrected for multiple comparisons (the Bonferroni method is most common in my experience), dividing the nominal alpha (usually .05) by the number of tests. So if there are 5 follow up tests, each comparison’s p-value must be below .01 to really “count” as significant. This reduces the test’s power considerably, but better guards against false-positives. It is common to correct all follow-up tests after a significant main effect, no matter the experimental design, but this is unnecessary when there are only three levels. H/T to Mike Aitken Deakin (here: @mrfaitkendeakin) and  Chris Chambers (here: @chrisdc77) for sharing.

The Logic of the Uncorrected Test

In the case of the One-Way ANOVA with three levels, it is not necessary to correct for the extra t-tests because the experimental design ensures that the family-wise error rate will necessarily stay at 5% — so long as no follow-up tests are carried out when the overall ANOVA is not significant.

A family-wise error rate (FWER) is the allowed tolerance for making at least 1 erroneous rejection of the null-hypothesis in a set of tests. If we make 2, 3, or even 4 erroneous rejections, it isn’t considered any worse than 1. Whether or not this makes sense is for another blog post. But taking this definition, we can think through the scenarios (outlined in Chris’s tweet) and see why no corrections are needed:

True relationship: µ1 = µ2 = µ3 (null-hypothesis is really true, all groups equal). If the main effect is not significant, no follow-up tests are run and the FWER remains at 5%. (If you run follow-up tests at this point you do need to correct for multiple comparisons.) If the main effect is significant, it does not matter what the follow-up tests show because we have already committed our allotted false-positive. In other words, we’ve already made the higher order mistake of saying that some differences are present before we even examine the individual group contrasts. Again, the FWER accounts for making at least 1 erroneous rejection. So no matter what our follow-up tests show, the FWER remains at 5% since we have already made our first false-positive before even conducting the follow-ups.

True relationship: µ1 ≠ µ2 = µ3, OR µ1 = µ2 ≠ µ3, OR µ1 ≠ µ3 = µ2  (null-hypothesis is really false, one group stands out). If the main effect is significant then we are correct, and no false-positive is possible at this level. We go with our follow-up tests (where it is really true that one group is different from the other two), where only one pair of means is truly equal. So that single pair is the only place for a possible false-positive result. Again, our FWER remains at 5% because we only have 1 opportunity to erroneously reject a null-hypothesis.

True relationship: µ1 ≠ µ2 ≠ µ3. A false-positive is impossible in this case because all three groups are truly different. All follow-up tests necessarily keep the FWER at 0%!

There is no possible scenario where your FWER goes above 5%, so no need to correct for multiple comparisons! 

So the next time Reviewer #2 gives you a hard time about correcting for multiple comparisons on a One-Way ANOVA with three levels, you can rightfully defend your uncorrected t-tests. Not correcting the alpha saves you some power, thereby making it easier to support your interesting findings.

If you wanted to sidestep the multiple comparison problem altogether you could do a fully Bayesian analysis, in which the number of tests conducted holds no weight on the evidence of a single test. So in other words, you could jump straight to the comparisons of interest instead of doing the significant main effect → follow-up test routine. Wouldn’t that save us all a lot of hassle?


The Broken Ratchet

In a recent paper, Tennie and colleagues provide new data with regard to the concept of cumulative cultural learning. They set out to find evidence for a cultural “ratchet”, a mechanism by which one secures advantageous behavior seen in others, while simultaneously improving the behavior to become more efficient/productive. This is most commonly done through diffusion chains, as is done here. The authors rounded up 80 four year olds (40 male, 40 female) and sorted them into chains of 5 kids each; leaving them with eight male and eight female chains. What follows is what I took away from this paper.

The kids’ task was simple: Try to fill a bucket with as much dry rice as possible. Two kids would be in the room at a time. Kids who completed their turn would swap out for kids who were new to the task, so that there was always 1 kid filling the bucket and 1 kid watching. The kids were given different tools they could potentially use (see their figure 1 below). Some tools were obviously better than others, carrying capacities: Bowl – 817.5g, Bucket – 439.7g, Scoop – 63.9g, Cardboard – 21.5g. In half of the chains, the first child saw an experimenter use the worst tool of the bunch (flimsy cardboard, circled in the figure) and the other half didn’t get a demonstration at all. As always, you can click on the figures to enlarge them.

Image As the authors said, “A main question of interest was whether children copied [Experimenter]’s and/or the previous child’s choice of tool or whether they innovated by introducing new tools”. In other words, evidence for a ratchet effect would manifest in later generations using more productive tools than the earlier generations. Another interest is whether this innovation differed between conditions- those that had an experimenter demonstrate or not. Not sure why this manipulation is interesting, seeing as the only kids who see the experimenter perform the task are in Generation 1.

ImageWithout even going into the stats, I don’t see much evidence that kids are ratcheting. Most chains in the baseline show the following pattern: Generation 1 uses tool X and all subsequent generations use tool X. Two chains manage to break the imitation spell, both switching from scoop+bucket to scoop+bowl. The experimental group shows a similar pattern, where the kids either all copy generation 1 (who copied the experimenter) or one adventurous kid in the chain decides to switch tools and the rest copy him/her. Interestingly, the chains in the experimental group only ever switched from the cardboard to the scoop, effectively going from the worst tool to the second-worst tool. If these kids were trying to score the most rice, wouldn’t it be best to switch to the bucket or the bowl? Weird.

The authors propose that kids in baseline didn’t innovate across generations because they were already performing at a high level in generation 1, so they didn’t have room to grow. Well, the only chains who did actually innovate in baseline started with scoop + bucket (second highest capacity tool) and went to scoop + bowl (highest capacity tool). Further, the chains in the lowest starting position, scoop only, never innovated.

Overall I thought the experiment was cool. Rounding up 80 four year olds is not to be scoffed at. But I don’t agree with their claim that the baseline group was at ceiling and I don’t see much ratcheting in the experimental group (who all start with the worst tool).

Lack of Power (and not the statistical kind)

One thing that never really comes up when people talk about “Questionable Research Practices,” is what to do when you’re a junior in the field and someone your senior suggests that you partake. [snip] It can be daunting to be the only one on who thinks we shouldn’t drop 2 outliers to get our p-value from .08 to .01, or who thinks we shouldn’t go collect 5 more subjects to make it “work.” When it is 1 vs 4 and you’re at the bottom of the totem pole, it rarely works out the way you want. It is hard not to get defensive, and you desperately want everyone to just come around to your thinking- but it doesn’t happen. What can the little guy say to the behemoths staring him down?

I’ve recently been put in this situation, and I am finding it to be a challenge that I don’t know how to overcome. It is difficult to explain to someone that what they are suggesting you do is [questionable] (At least not without sounding accusatory). I can explain the problems with letting our post hoc p-value guide interpretation, or the problems for replicability when the analysis plan isn’t predetermined, or the problems with cherry picking outliers, but it’s really an ethical issue at its core. I don’t want to engage in what I know is a [questionable] practice, but I don’t have a choice. I can’t afford to burn bridges when those same bridges are the only things that get me over the water and into a job.

I’ve realized that this amazing movement in the field of psychology has left me feeling somewhat helpless. When push comes to shove, the one running the lab wins and I have to yield- even against my better judgment. After six five months of data collection, am I supposed to just step away and not put my name on the work? There’s something to that, I suppose. A bit of poetic justice. But justice doesn’t get you into grad school, or get you a PhD, or get you a faculty job, or get you a grant, or get you tenure. The pressure is real for the ones at the bottom. I think more attention needs to be paid to this aspect of the psychology movement. I can’t be the only one who feels like I know what I should (and shouldn’t) be doing but don’t have a choice.

Edit: See another great point of view on this issue here

edit3: Changed some language

An undergraduate’s experience with replications

A lot of psychologists are in a bit of a tiff right now. I think everyone agrees that replications are important, but it doesn’t seem like there is a consensus for how it should go about (For many perspectives, see: here, here, here, here, here, here, here). Since Sanjay asked for more perspectives from people who aren’t tenured, I figured I’d write up my experience with replication. Take note, I graduated but I am not in graduate school yet, so I am one vulnerable puppy. Luckily my experience was very civil.

During my junior/senior year fellowship, I ran 2 identical direct replications of a psychophysics experiment and both were disappointing. I wasn’t the first person in the lab to try to replicate it either: the addition of my “failures” made it 5 collective unsuccessful replications. At what point do you throw in the towel and say, “We’re never gonna get it”? I went on to manipulate the stimuli and task and ended up finding some cool results, but the taste of sour data was still in my mouth. The worst part was that I had to slap my “failures to replicate” on a poster and travel cross-country to present them at a conference. I was nervous before presenting, because how are you supposed to explain failures to replicate in psychophysics? It’s not like social psych, where one can point to the specter of “unknown moderators” (no offense, that’s my field now).

So, how did the conference go? Very well I should think. I was not surprised by some reactions I got from viewers when I said those dreaded words, “failed to replicate,” on the order of: “Oh wow, that sucks for them,” “Welp, that’s never good,” “Oh no! He’s in my department…..that’s embarrassing,” “Did you really try 5 times? I would have stopped after 1.” The most stress-inducing part of the whole thing was when the person I was failing to replicate came up and introduced himself. I was expecting hurt feelings, or animosity. What I got was a reasonable reply from a senior in my field. He said, “Well, that’s really too bad. You never got it in 5 tries? Hmmm…. I guess we might have overestimated how robust that effect is. It could be that it is just a weak effect. We’ve moved on since then to show the effect with other stimuli but we haven’t done this exact setup again, maybe we should. Thanks for sharing with me, if you write up the manuscript I’d love it if you sent it to me when it’s done.”

What a reasonable guy. I was expecting barred teeth and a death stare, but what I got was a senior in the field who was open to revising his beliefs.

One thing to note: his comment, “if you write up the manuscript I’d love it if you sent it to me when it’s done (emphasis added)” really highlights the view that replications are likely to be dropped if they “fail.” Hopefully this special issue can change the culture and change that if to when. Thanks to Daniel Lakens (@lakens) and Brian Nosek (@BrianNosek) for trailblazing.

Musings on correlations- doubling my sample size doesn’t help much

I’ve recently run an experiment where I train kids on a computer task and see how they improve after a practice session. We want to see if the kids improve more as they get older, and so we calculate the correlation between the kids’ ages (in months) and their improvement scores.¹ If we tested 40 kids and found a correlation of .30, how much should we trust our result? I did some simulations to find out. This was inspired by a recent paper by Stanley and Spence (2014).

A common way to represent the uncertainty present in an estimate is to calculate the confidence interval (usually 95%) associated with that estimate. Shorter intervals mean less uncertainty in the estimate. The calculations for 95% confidence intervals ensure that, in the very long run, 95% of your intervals will capture the true population value. The key here is that only through repeated sampling can you be confident that most of your intervals will be in range of that true population value. For example, if I find my correlation estimate is r=.30 95%CI [-.01, .56] then presumably the true correlation could be anywhere in that range, or even outside of it if I’m unlucky (you really never know). That can say a lot to some stats people, but I like to see what it actually looks like through simulation.

Below are results for 500 samples of 40 participants each when the population correlation is .10, .30, or .50 (signified by the vertical line- that weird p on the axis is called rho) and beside each is an example of what that correlation might look like. You can click the picture to see it larger. Each sample is pulling from the same population, meaning that the variation is only due to sampling error. Each green dot is a sample whose correlation estimate and 95% interval capture the true correlation, red dots are samples that fail to capture the true correlation. As you can see, with 40 participants in each sample there is a lot of variation. Imagine your experiment as picking one of these dots at random.

figs 40nSome quick observations: 1) most samples fall fairly close to the true value, 2) the range on all of these samples is huge. So with 40 subjects in our sample, a correlation of .30 is green in each of these different populations. How can we know which one we are actually sampling from if we only have our single sample? One commonly proposed solution is to take larger samples. If each sample consisted of 80 participants instead of 40, we would expect the sample heaps to be narrower. But how does it change our interpretation of our example r=.30? With n=80, the 95% CI around .30 ranges from .09 to .49.

figs 80nNow with n=80 the interpretation of our result only changes slightly. When the true correlation is .10 our r=.30 is still just ever so slightly green; remember, our 95% CI ranged as low as .09. However, when the true correlation is .50 our r=.30 is now red; our 95% CI ranged only as high as .49. But remember, when you only have your single sample you don’t know what color it is- Always remember that it could be red! In the very long run 5% of all samples will be red.

So what is the takeaway from all of this? When we doubled our sample size from n=40 to n=80, our 95% CI shrunk from [-.01, .56] to [.09, .49]- at least we can tentatively rule out a sign error² when we double the sample. That really isn’t much. And when you look at the figures, the range of estimates gets smaller for each respective population- but not much changes in terms of our interpretation of that single r=.30. That really sucks. It’s hard enough for me to collect data and train 40 preschoolers, let alone 80. But even if I did double my efforts I wouldn’t get much out of it! That really really sucks.

There is no happy ending to this blog post. This seems bleak! Hopefully other people can replicate my study’s findings and we can pool our knowledge to end up with more informative estimates of these effects.

Thanks for reading, please comment!


¹For those who need a refresher on correlations, a correlation can be negative (an increase in age corresponds with a decrease in improvement score) or positive (increase in age -> increase improvement score). The range goes from -1, meaning a perfect negative correlation, to +1, meaning a perfect positive correlation. Those never actually happen in real experiments.

²A sign error is claiming with confidence that the true correlation is positive when in fact it is negative, or vice versa.