Can confidence intervals save psychology? Part 1

Maybe, but probably not by themselves. This post was inspired by Christian Jarrett‘s recent post (you should go read it if you missed it), and the resulting twitter discussion. This will likely develop into a series of posts on confidence intervals.

Geoff Cumming is a big proponent of replacing all hypothesis testing with CI reporting. He says we should change the goal to be precise estimation of effects using confidence intervals, with a goal of facilitating future meta-analyses. But do we understand confidence intervals? (More estimation is something I can get behind, but I think there is still room for hypothesis testing.)

In the twitter discussion, Ryne commented, “If 95% of my CIs contain Mu, then there is .95 prob this one does [emphasis mine]. How is that wrong?” It’s wrong for the same reason Bayesian advocates dislike frequency statistics- You cannot assign probabilities to single events or parameters in that framework. The .95 probability is a property of the process of creating CIs in the long-run, it is not associated with any given interval. That means you cannot make any probabilistic claims about this interval containing Mu, or otherwise, this particular hypothesis being true

In the frequency statistics framework, all probabilities are long-run frequencies (i.e., a proportion of times an outcome occurs out of all possible related outcomes). As such, all statements about associated probabilities must be of that nature. If a fair coin has an associated probability of 50% heads, and I flip a fair coin very many times, then in the long-run I will obtain half heads and half tails. In any given next flip there is no associated probability of heads. This flip is either heads (p(H) = 1) or tails (p(H) = 0) and we don’t know which until after we flip.¹ By assigning probabilities to single events the sense of a long-run frequency is lost (i.e., one flip is not a collective of all flips). As von Mises puts it:

Our probability theory [frequency statistics] has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?” (von Mises, 1957, p. 9, quoted in Oakes, 1986, p. 16)

This is why Ryne’s statement was wrong, and this is why there can be no statements of the kind, “X is the probability that these results are due to chance,”² or “There is a 50% chance that the next flip will be heads,” or “This hypothesis is probably false,” when one adopts the frequency statistics framework. All probabilities are long-run frequencies in a relevant “collective.” (Have I beaten this horse to death yet?) It’s counter-intuitive and strange that we cannot speak of any single event or parameter’s probability. But sadly we can’t in this framework, and as such, “There is .95 probability that Mu is captured by this CI,” is a vacuous statement. If you want to assign probabilities to single events and parameters come join us over in Bayesianland (we have cookies).

EDIT 11/17: See Ryne’s post for why he rejects the technical definition for a pragmatic definition.


¹But don’t tell Daryl Bem that.

²Often a confused interpretation of the p-value. The correct interpretation is subtly different: “The probability of the obtained (or more extreme) results given chance.” “Given” is the key difference, because here you are assuming chance. How can an analysis assuming chance is true (i.e., p(chance) = 1) lead to a probability statement about chance being false?


Cumming, G. (2013). The new statistics why and how. Psychological science, 0956797613504966.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.

11 thoughts on “Can confidence intervals save psychology? Part 1

  1. If an experiment, in the long run, results in the event A 50% of the time, then this what “the probability of A in a single experiment is 0.5″ means in a frequentist setting. It is simply layer of interpretation.

    This seems quite clear. How would you describe what the probabilities “really mean” in a bayesian setting? (Is it anything that gives anyone a better chance to understand the outcome of a single experiment than the frequentist explanation?)

    • Thanks for your comment, though I’m not sure I follow your layer of interpretation. Probability in the frequency framework is simple counting. We conduct infinite repetitions of our experiment and count the As and not-As and divide by the total. If in the end half of our experiments come out with A and half come out with not-A, the probability associated with A (number if times A out of total times [A plus not-A]) is .50. But if my total number of experiments is 1, i.e. my collective is a single experiment, then my proportion can only be an integer because my counts are either A (1) or not-A (0). So my proportion A out of total (A plus not-A) is limited to either 1/1 or 0/1.

      You ask what Bayes defines probabilities as, and it is much more intuitive: degree of plausibility, or degree of belief. The degrees of belief are not limited to counts and the parameters are not fixed. When one calculates a credibility interval, or highest density interval, it means what almost everyone thinks confidence intervals mean: The probability is .95 that mu is contained in this range (x, y).

      • I’m saying what a frequentist interpretation of the statement “A has probability .5 (in a single experiment)” means. The layer of interpretation is to translate that sentence to “we are using a model that has A occur 50% of the time in the long run”. I disagree your view that we cannot speak of probabilities for single events (as long as we understand what we mean).

        To say that probabilities are “degrees of plausibility” seems to say nothing of the world UNLESS you specify something like: in the long run, an event with more degrees of plausibility will occur more often than an event with less degree of plausibility. And then you are very close to a frequentist explanation.

        For confidence intervals – the main topic of your post – the question is more vague. The most correct thing to say would be along the lines of “this interval is an observation of a stochastic interval which has probability .95 to cover the parameter”. That seems logically equivalent to (although the emphasis is changed) “this interval is an observation of a stochastic interval which the parameter is in with probability .95”. If you think we shouldn’t say that “the parameter is in the OBSERVED interval with prob .95”, then ok, since we are no longer talking about the stochastic interval, NOT because it is a single event/experiment.

        • Thanks for your clarification. If you are addressing the probability of the stochastic process of obtaining your interval and having it capture a fixed mu, then fine, call it .95. But you are assigning that probability to the process’s relative frequency. “The process behaves as such and such in the long run” is surely the correct way to describe the relative frequency properties of the process, because it correctly refers to the behavior with regard to the referenced collective. “As long as we understand what they mean” is right. People don’t understand what it means. And they don’t usually mean to say anything about the process, and they surely don’t translate it into the type of phrase you wrote. If they did we wouldn’t have any problems.

          The point I was making is exactly the one you mention at the end. The question Ryne had was why he cannot assign .95 to _his_ particular interval, and the answer is because it does not reference a long run relative frequency. There is no long run for _this_ interval.

          So the confusion seems to have come in when I mention single experiments. Sure, the single experiment can be generated through a process that behaves some way when repeated so many times. And we can talk about and report that process’s behavior as our probability. But the behavior of the stochastic process is not the behavior of the event it generates, which you say at the end of your comment (and on which we seem to agree). Ryne rejects this, and thinks we should be able to assign a probability of capturing a fixed mu to this interval. I also reject it, but I don’t subscribe to frequency statistics.

  2. Anyway, thanks for the post.

    I came to statistics from math, so I mainly think of probability as “a measure”. Being involved these days in teaching statistics to non-mathematicians I sometimes necessarily need to simplify concept,
    but I am always wary of how that is done.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s