Can confidence intervals save psychology? Part 2

This is part 2 in a series about confidence intervals (here’s part 1). Answering the question in the title is not really my goal, but simply to discuss confidence intervals and their pros and cons. The last post explained why frequency statistics (and confidence intervals) can’t assign probabilities to one-time events, but always refer to a collective of long-run events.

If confidence intervals don’t really tell us what we want to know, does that mean we should throw them in the dumpster along with our p-values? No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p. Eventually we may want to drop CIs for more nuanced inference, but for the time being we would do much better with this simple switch.

If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean). Can we ever know which ones? Tragically, no. But some would feel pretty good about the process being used if it only has a 5% life-time error rate. One could achieve a lower error rate by stretching the intervals (to say, 99%) but that would leave them too embarrassingly wide for most.

If we use p we will be wrong 5% of the time in the long-run when we are testing a true null-hypothesis (i.e., no association between variables, or no difference between means, etc., and assuming the analysis is 100% pre-planned). But when we are testing a false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run (Button et al., 2013; Cohen, 1962; Sedlmeier & Gigerenzer, 1989). If you are one of the many who do not believe a null-hypothesis can actually be true, then we are always in the latter scenario with that huge error rate. In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result “significant” whenever it lands heads. 

There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? Yes, unfortunately they do (Coulson et al., 2010).

References

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but don’t guarantee, better inference than statistical significance testing.Frontiers in psychology, 1, 26.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105(2), 309.

http://datacolada.org/2014/10/08/28-confidence-intervals-dont-change-how-we-think-about-data/

Can confidence intervals save psychology? Part 1

Maybe, but probably not by themselves. This post was inspired by Christian Jarrett‘s recent post (you should go read it if you missed it), and the resulting twitter discussion. This will likely develop into a series of posts on confidence intervals.

Geoff Cumming is a big proponent of replacing all hypothesis testing with CI reporting. He says we should change the goal to be precise estimation of effects using confidence intervals, with a goal of facilitating future meta-analyses. But do we understand confidence intervals? (More estimation is something I can get behind, but I think there is still room for hypothesis testing.)

In the twitter discussion, Ryne commented, “If 95% of my CIs contain Mu, then there is .95 prob this one does [emphasis mine]. How is that wrong?” It’s wrong for the same reason Bayesian advocates dislike frequency statistics- You cannot assign probabilities to single events or parameters in that framework. The .95 probability is a property of the process of creating CIs in the long-run, it is not associated with any given interval. That means you cannot make any probabilistic claims about this interval containing Mu, or otherwise, this particular hypothesis being true

In the frequency statistics framework, all probabilities are long-run frequencies (i.e., a proportion of times an outcome occurs out of all possible related outcomes). As such, all statements about associated probabilities must be of that nature. If a fair coin has an associated probability of 50% heads, and I flip a fair coin very many times, then in the long-run I will obtain half heads and half tails. In any given next flip there is no associated probability of heads. This flip is either heads (p(H) = 1) or tails (p(H) = 0) and we don’t know which until after we flip.¹ By assigning probabilities to single events the sense of a long-run frequency is lost (i.e., one flip is not a collective of all flips). As von Mises puts it:

Our probability theory [frequency statistics] has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?” (von Mises, 1957, p. 9, quoted in Oakes, 1986, p. 16)

This is why Ryne’s statement was wrong, and this is why there can be no statements of the kind, “X is the probability that these results are due to chance,”² or “There is a 50% chance that the next flip will be heads,” or “This hypothesis is probably false,” when one adopts the frequency statistics framework. All probabilities are long-run frequencies in a relevant “collective.” (Have I beaten this horse to death yet?) It’s counter-intuitive and strange that we cannot speak of any single event or parameter’s probability. But sadly we can’t in this framework, and as such, “There is .95 probability that Mu is captured by this CI,” is a vacuous statement. If you want to assign probabilities to single events and parameters come join us over in Bayesianland (we have cookies).

EDIT 11/17: See Ryne’s post for why he rejects the technical definition for a pragmatic definition.

Notes:

¹But don’t tell Daryl Bem that.

²Often a confused interpretation of the p-value. The correct interpretation is subtly different: “The probability of the obtained (or more extreme) results given chance.” “Given” is the key difference, because here you are assuming chance. How can an analysis assuming chance is true (i.e., p(chance) = 1) lead to a probability statement about chance being false?

References:

Cumming, G. (2013). The new statistics why and how. Psychological science, 0956797613504966.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.