Can confidence intervals save psychology? Part 2

This is part 2 in a series about confidence intervals (here’s part 1). Answering the question in the title is not really my goal, but simply to discuss confidence intervals and their pros and cons. The last post explained why frequency statistics (and confidence intervals) can’t assign probabilities to one-time events, but always refer to a collective of long-run events.

If confidence intervals don’t really tell us what we want to know, does that mean we should throw them in the dumpster along with our p-values? No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p. Eventually we may want to drop CIs for more nuanced inference, but for the time being we would do much better with this simple switch.

If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean). Can we ever know which ones? Tragically, no. But some would feel pretty good about the process being used if it only has a 5% life-time error rate. One could achieve a lower error rate by stretching the intervals (to say, 99%) but that would leave them too embarrassingly wide for most.

If we use p we will be wrong 5% of the time in the long-run when we are testing a true null-hypothesis (i.e., no association between variables, or no difference between means, etc., and assuming the analysis is 100% pre-planned). But when we are testing a false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run (Button et al., 2013; Cohen, 1962; Sedlmeier & Gigerenzer, 1989). If you are one of the many who do not believe a null-hypothesis can actually be true, then we are always in the latter scenario with that huge error rate. In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result “significant” whenever it lands heads. 

There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? Yes, unfortunately they do (Coulson et al., 2010).

References

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but don’t guarantee, better inference than statistical significance testing.Frontiers in psychology, 1, 26.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105(2), 309.

[20] We cannot afford to study effect size in the lab

Can confidence intervals save psychology? Part 1

Maybe, but probably not by themselves. This post was inspired by Christian Jarrett‘s recent post (you should go read it if you missed it), and the resulting twitter discussion. This will likely develop into a series of posts on confidence intervals.

Geoff Cumming is a big proponent of replacing all hypothesis testing with CI reporting. He says we should change the goal to be precise estimation of effects using confidence intervals, with a goal of facilitating future meta-analyses. But do we understand confidence intervals? (More estimation is something I can get behind, but I think there is still room for hypothesis testing.)

In the twitter discussion, Ryne commented, “If 95% of my CIs contain Mu, then there is .95 prob this one does [emphasis mine]. How is that wrong?” It’s wrong for the same reason Bayesian advocates dislike frequency statistics- You cannot assign probabilities to single events or parameters in that framework. The .95 probability is a property of the process of creating CIs in the long-run, it is not associated with any given interval. That means you cannot make any probabilistic claims about this interval containing Mu, or otherwise, this particular hypothesis being true

In the frequency statistics framework, all probabilities are long-run frequencies (i.e., a proportion of times an outcome occurs out of all possible related outcomes). As such, all statements about associated probabilities must be of that nature. If a fair coin has an associated probability of 50% heads, and I flip a fair coin very many times, then in the long-run I will obtain half heads and half tails. In any given next flip there is no associated probability of heads. This flip is either heads (p(H) = 1) or tails (p(H) = 0) and we don’t know which until after we flip.¹ By assigning probabilities to single events the sense of a long-run frequency is lost (i.e., one flip is not a collective of all flips). As von Mises puts it:

Our probability theory [frequency statistics] has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?” (von Mises, 1957, p. 9, quoted in Oakes, 1986, p. 16)

This is why Ryne’s statement was wrong, and this is why there can be no statements of the kind, “X is the probability that these results are due to chance,”² or “There is a 50% chance that the next flip will be heads,” or “This hypothesis is probably false,” when one adopts the frequency statistics framework. All probabilities are long-run frequencies in a relevant “collective.” (Have I beaten this horse to death yet?) It’s counter-intuitive and strange that we cannot speak of any single event or parameter’s probability. But sadly we can’t in this framework, and as such, “There is .95 probability that Mu is captured by this CI,” is a vacuous statement. If you want to assign probabilities to single events and parameters come join us over in Bayesianland (we have cookies).

EDIT 11/17: See Ryne’s post for why he rejects the technical definition for a pragmatic definition.

Notes:

¹But don’t tell Daryl Bem that.

²Often a confused interpretation of the p-value. The correct interpretation is subtly different: “The probability of the obtained (or more extreme) results given chance.” “Given” is the key difference, because here you are assuming chance. How can an analysis assuming chance is true (i.e., p(chance) = 1) lead to a probability statement about chance being false?

References:

Cumming, G. (2013). The new statistics why and how. Psychological science, 0956797613504966.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.

The Special One-Way ANOVA (or, Shutting up Reviewer #2)

The One-Way Analysis of Variance (ANOVA) is a handy procedure that is commonly used when a researcher has three or more groups that they want to compare. If the test comes up significant, follow-up tests are run to determine which groups show meaningful differences. These follow-up tests are often corrected for multiple comparisons (the Bonferroni method is most common in my experience), dividing the nominal alpha (usually .05) by the number of tests. So if there are 5 follow up tests, each comparison’s p-value must be below .01 to really “count” as significant. This reduces the test’s power considerably, but better guards against false-positives. It is common to correct all follow-up tests after a significant main effect, no matter the experimental design, but this is unnecessary when there are only three levels. H/T to Mike Aitken Deakin (here: @mrfaitkendeakin) and  Chris Chambers (here: @chrisdc77) for sharing.

The Logic of the Uncorrected Test

In the case of the One-Way ANOVA with three levels, it is not necessary to correct for the extra t-tests because the experimental design ensures that the family-wise error rate will necessarily stay at 5% — so long as no follow-up tests are carried out when the overall ANOVA is not significant.

A family-wise error rate (FWER) is the allowed tolerance for making at least 1 erroneous rejection of the null-hypothesis in a set of tests. If we make 2, 3, or even 4 erroneous rejections, it isn’t considered any worse than 1. Whether or not this makes sense is for another blog post. But taking this definition, we can think through the scenarios (outlined in Chris’s tweet) and see why no corrections are needed:

True relationship: µ1 = µ2 = µ3 (null-hypothesis is really true, all groups equal). If the main effect is not significant, no follow-up tests are run and the FWER remains at 5%. (If you run follow-up tests at this point you do need to correct for multiple comparisons.) If the main effect is significant, it does not matter what the follow-up tests show because we have already committed our allotted false-positive. In other words, we’ve already made the higher order mistake of saying that some differences are present before we even examine the individual group contrasts. Again, the FWER accounts for making at least 1 erroneous rejection. So no matter what our follow-up tests show, the FWER remains at 5% since we have already made our first false-positive before even conducting the follow-ups.

True relationship: µ1 ≠ µ2 = µ3, OR µ1 = µ2 ≠ µ3, OR µ1 ≠ µ3 = µ2  (null-hypothesis is really false, one group stands out). If the main effect is significant then we are correct, and no false-positive is possible at this level. We go with our follow-up tests (where it is really true that one group is different from the other two), where only one pair of means is truly equal. So that single pair is the only place for a possible false-positive result. Again, our FWER remains at 5% because we only have 1 opportunity to erroneously reject a null-hypothesis.

True relationship: µ1 ≠ µ2 ≠ µ3. A false-positive is impossible in this case because all three groups are truly different. All follow-up tests necessarily keep the FWER at 0%!

There is no possible scenario where your FWER goes above 5%, so no need to correct for multiple comparisons! 

So the next time Reviewer #2 gives you a hard time about correcting for multiple comparisons on a One-Way ANOVA with three levels, you can rightfully defend your uncorrected t-tests. Not correcting the alpha saves you some power, thereby making it easier to support your interesting findings.

If you wanted to sidestep the multiple comparison problem altogether you could do a fully Bayesian analysis, in which the number of tests conducted holds no weight on the evidence of a single test. So in other words, you could jump straight to the comparisons of interest instead of doing the significant main effect → follow-up test routine. Wouldn’t that save us all a lot of hassle?

 

The Broken Ratchet

In a recent paper, Tennie and colleagues provide new data with regard to the concept of cumulative cultural learning. They set out to find evidence for a cultural “ratchet”, a mechanism by which one secures advantageous behavior seen in others, while simultaneously improving the behavior to become more efficient/productive. This is most commonly done through diffusion chains, as is done here. The authors rounded up 80 four year olds (40 male, 40 female) and sorted them into chains of 5 kids each; leaving them with eight male and eight female chains. What follows is what I took away from this paper.

The kids’ task was simple: Try to fill a bucket with as much dry rice as possible. Two kids would be in the room at a time. Kids who completed their turn would swap out for kids who were new to the task, so that there was always 1 kid filling the bucket and 1 kid watching. The kids were given different tools they could potentially use (see their figure 1 below). Some tools were obviously better than others, carrying capacities: Bowl – 817.5g, Bucket – 439.7g, Scoop – 63.9g, Cardboard – 21.5g. In half of the chains, the first child saw an experimenter use the worst tool of the bunch (flimsy cardboard, circled in the figure) and the other half didn’t get a demonstration at all. As always, you can click on the figures to enlarge them.

Image As the authors said, “A main question of interest was whether children copied [Experimenter]’s and/or the previous child’s choice of tool or whether they innovated by introducing new tools”. In other words, evidence for a ratchet effect would manifest in later generations using more productive tools than the earlier generations. Another interest is whether this innovation differed between conditions- those that had an experimenter demonstrate or not. Not sure why this manipulation is interesting, seeing as the only kids who see the experimenter perform the task are in Generation 1.

ImageWithout even going into the stats, I don’t see much evidence that kids are ratcheting. Most chains in the baseline show the following pattern: Generation 1 uses tool X and all subsequent generations use tool X. Two chains manage to break the imitation spell, both switching from scoop+bucket to scoop+bowl. The experimental group shows a similar pattern, where the kids either all copy generation 1 (who copied the experimenter) or one adventurous kid in the chain decides to switch tools and the rest copy him/her. Interestingly, the chains in the experimental group only ever switched from the cardboard to the scoop, effectively going from the worst tool to the second-worst tool. If these kids were trying to score the most rice, wouldn’t it be best to switch to the bucket or the bowl? Weird.

The authors propose that kids in baseline didn’t innovate across generations because they were already performing at a high level in generation 1, so they didn’t have room to grow. Well, the only chains who did actually innovate in baseline started with scoop + bucket (second highest capacity tool) and went to scoop + bowl (highest capacity tool). Further, the chains in the lowest starting position, scoop only, never innovated.

Overall I thought the experiment was cool. Rounding up 80 four year olds is not to be scoffed at. But I don’t agree with their claim that the baseline group was at ceiling and I don’t see much ratcheting in the experimental group (who all start with the worst tool).

Lack of Power (and not the statistical kind)

One thing that never really comes up when people talk about “Questionable Research Practices,” is what to do when you’re a junior in the field and someone your senior suggests that you partake. [snip] It can be daunting to be the only one on who thinks we shouldn’t drop 2 outliers to get our p-value from .08 to .01, or who thinks we shouldn’t go collect 5 more subjects to make it “work.” When it is 1 vs 4 and you’re at the bottom of the totem pole, it rarely works out the way you want. It is hard not to get defensive, and you desperately want everyone to just come around to your thinking- but it doesn’t happen. What can the little guy say to the behemoths staring him down?

I’ve recently been put in this situation, and I am finding it to be a challenge that I don’t know how to overcome. It is difficult to explain to someone that what they are suggesting you do is [questionable] (At least not without sounding accusatory). I can explain the problems with letting our post hoc p-value guide interpretation, or the problems for replicability when the analysis plan isn’t predetermined, or the problems with cherry picking outliers, but it’s really an ethical issue at its core. I don’t want to engage in what I know is a [questionable] practice, but I don’t have a choice. I can’t afford to burn bridges when those same bridges are the only things that get me over the water and into a job.

I’ve realized that this amazing movement in the field of psychology has left me feeling somewhat helpless. When push comes to shove, the one running the lab wins and I have to yield- even against my better judgment. After six five months of data collection, am I supposed to just step away and not put my name on the work? There’s something to that, I suppose. A bit of poetic justice. But justice doesn’t get you into grad school, or get you a PhD, or get you a faculty job, or get you a grant, or get you tenure. The pressure is real for the ones at the bottom. I think more attention needs to be paid to this aspect of the psychology movement. I can’t be the only one who feels like I know what I should (and shouldn’t) be doing but don’t have a choice.

Edit: See another great point of view on this issue here http://jonathanramsay.com/questionable-research-practices-the-grad-student-perspective/

edit3: Changed some language