The Special One-Way ANOVA (or, Shutting up Reviewer #2)

The One-Way Analysis of Variance (ANOVA) is a handy procedure that is commonly used when a researcher has three or more groups that they want to compare. If the test comes up significant, follow-up tests are run to determine which groups show meaningful differences. These follow-up tests are often corrected for multiple comparisons (the Bonferroni method is most common in my experience), dividing the nominal alpha (usually .05) by the number of tests. So if there are 5 follow up tests, each comparison’s p-value must be below .01 to really “count” as significant. This reduces the test’s power considerably, but better guards against false-positives. It is common to correct all follow-up tests after a significant main effect, no matter the experimental design, but this is unnecessary when there are only three levels. H/T to Mike Aitken Deakin (here: @mrfaitkendeakin) and  Chris Chambers (here: @chrisdc77) for sharing.

The Logic of the Uncorrected Test

In the case of the One-Way ANOVA with three levels, it is not necessary to correct for the extra t-tests because the experimental design ensures that the family-wise error rate will necessarily stay at 5% — so long as no follow-up tests are carried out when the overall ANOVA is not significant.

A family-wise error rate (FWER) is the allowed tolerance for making at least 1 erroneous rejection of the null-hypothesis in a set of tests. If we make 2, 3, or even 4 erroneous rejections, it isn’t considered any worse than 1. Whether or not this makes sense is for another blog post. But taking this definition, we can think through the scenarios (outlined in Chris’s tweet) and see why no corrections are needed:

True relationship: µ1 = µ2 = µ3 (null-hypothesis is really true, all groups equal). If the main effect is not significant, no follow-up tests are run and the FWER remains at 5%. (If you run follow-up tests at this point you do need to correct for multiple comparisons.) If the main effect is significant, it does not matter what the follow-up tests show because we have already committed our allotted false-positive. In other words, we’ve already made the higher order mistake of saying that some differences are present before we even examine the individual group contrasts. Again, the FWER accounts for making at least 1 erroneous rejection. So no matter what our follow-up tests show, the FWER remains at 5% since we have already made our first false-positive before even conducting the follow-ups.

True relationship: µ1 ≠ µ2 = µ3, OR µ1 = µ2 ≠ µ3, OR µ1 ≠ µ3 = µ2  (null-hypothesis is really false, one group stands out). If the main effect is significant then we are correct, and no false-positive is possible at this level. We go with our follow-up tests (where it is really true that one group is different from the other two), where only one pair of means is truly equal. So that single pair is the only place for a possible false-positive result. Again, our FWER remains at 5% because we only have 1 opportunity to erroneously reject a null-hypothesis.

True relationship: µ1 ≠ µ2 ≠ µ3. A false-positive is impossible in this case because all three groups are truly different. All follow-up tests necessarily keep the FWER at 0%!

There is no possible scenario where your FWER goes above 5%, so no need to correct for multiple comparisons! 

So the next time Reviewer #2 gives you a hard time about correcting for multiple comparisons on a One-Way ANOVA with three levels, you can rightfully defend your uncorrected t-tests. Not correcting the alpha saves you some power, thereby making it easier to support your interesting findings.

If you wanted to sidestep the multiple comparison problem altogether you could do a fully Bayesian analysis, in which the number of tests conducted holds no weight on the evidence of a single test. So in other words, you could jump straight to the comparisons of interest instead of doing the significant main effect → follow-up test routine. Wouldn’t that save us all a lot of hassle?


Lack of Power (and not the statistical kind)

One thing that never really comes up when people talk about “Questionable Research Practices,” is what to do when you’re a junior in the field and someone your senior suggests that you partake. [snip] It can be daunting to be the only one on who thinks we shouldn’t drop 2 outliers to get our p-value from .08 to .01, or who thinks we shouldn’t go collect 5 more subjects to make it “work.” When it is 1 vs 4 and you’re at the bottom of the totem pole, it rarely works out the way you want. It is hard not to get defensive, and you desperately want everyone to just come around to your thinking- but it doesn’t happen. What can the little guy say to the behemoths staring him down?

I’ve recently been put in this situation, and I am finding it to be a challenge that I don’t know how to overcome. It is difficult to explain to someone that what they are suggesting you do is [questionable] (At least not without sounding accusatory). I can explain the problems with letting our post hoc p-value guide interpretation, or the problems for replicability when the analysis plan isn’t predetermined, or the problems with cherry picking outliers, but it’s really an ethical issue at its core. I don’t want to engage in what I know is a [questionable] practice, but I don’t have a choice. I can’t afford to burn bridges when those same bridges are the only things that get me over the water and into a job.

I’ve realized that this amazing movement in the field of psychology has left me feeling somewhat helpless. When push comes to shove, the one running the lab wins and I have to yield- even against my better judgment. After six five months of data collection, am I supposed to just step away and not put my name on the work? There’s something to that, I suppose. A bit of poetic justice. But justice doesn’t get you into grad school, or get you a PhD, or get you a faculty job, or get you a grant, or get you tenure. The pressure is real for the ones at the bottom. I think more attention needs to be paid to this aspect of the psychology movement. I can’t be the only one who feels like I know what I should (and shouldn’t) be doing but don’t have a choice.

Edit: See another great point of view on this issue here

edit3: Changed some language

An undergraduate’s experience with replications

A lot of psychologists are in a bit of a tiff right now. I think everyone agrees that replications are important, but it doesn’t seem like there is a consensus for how it should go about (For many perspectives, see: here, here, here, here, here, here, here). Since Sanjay asked for more perspectives from people who aren’t tenured, I figured I’d write up my experience with replication. Take note, I graduated but I am not in graduate school yet, so I am one vulnerable puppy. Luckily my experience was very civil.

During my junior/senior year fellowship, I ran 2 identical direct replications of a psychophysics experiment and both were disappointing. I wasn’t the first person in the lab to try to replicate it either: the addition of my “failures” made it 5 collective unsuccessful replications. At what point do you throw in the towel and say, “We’re never gonna get it”? I went on to manipulate the stimuli and task and ended up finding some cool results, but the taste of sour data was still in my mouth. The worst part was that I had to slap my “failures to replicate” on a poster and travel cross-country to present them at a conference. I was nervous before presenting, because how are you supposed to explain failures to replicate in psychophysics? It’s not like social psych, where one can point to the specter of “unknown moderators” (no offense, that’s my field now).

So, how did the conference go? Very well I should think. I was not surprised by some reactions I got from viewers when I said those dreaded words, “failed to replicate,” on the order of: “Oh wow, that sucks for them,” “Welp, that’s never good,” “Oh no! He’s in my department…..that’s embarrassing,” “Did you really try 5 times? I would have stopped after 1.” The most stress-inducing part of the whole thing was when the person I was failing to replicate came up and introduced himself. I was expecting hurt feelings, or animosity. What I got was a reasonable reply from a senior in my field. He said, “Well, that’s really too bad. You never got it in 5 tries? Hmmm…. I guess we might have overestimated how robust that effect is. It could be that it is just a weak effect. We’ve moved on since then to show the effect with other stimuli but we haven’t done this exact setup again, maybe we should. Thanks for sharing with me, if you write up the manuscript I’d love it if you sent it to me when it’s done.”

What a reasonable guy. I was expecting barred teeth and a death stare, but what I got was a senior in the field who was open to revising his beliefs.

One thing to note: his comment, “if you write up the manuscript I’d love it if you sent it to me when it’s done (emphasis added)” really highlights the view that replications are likely to be dropped if they “fail.” Hopefully this special issue can change the culture and change that if to when. Thanks to Daniel Lakens (@lakens) and Brian Nosek (@BrianNosek) for trailblazing.