Optional stopping does not affect the interpretation of posterior odds. Even with optional stopping, a researcher can interpret the posterior odds as updated beliefs about hypotheses in light of data.

–Rouder, 2014 (pdf link)

*Sunday Bayes*

The format of this series is short and simple: Every week I will give a quick summary of a paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere.

### Optional stopping: No problem for Bayesians

Bayesian analysts use probability to express a degree of belief. For a flipped coin, a probability of 3/4 means that the analyst believes it is three times more likely that the coin will land heads than tails. Such a conceptualization is very convenient in science, where researchers hold beliefs about the plausibility of theories, hypotheses, and models that may be updated as new data become available. (p. 302)

It is becoming increasingly common to evaluate statistical procedures by way of simulation. Instead of doing formal analyses, we can use flexible simulations to tune many different parameters and immediately see the effect it has on the behavior of a procedure.

Simulation results have a tangible, experimental feel; moreover, if something is true mathematically, we should be able to see it in simulation as well. (p. 303)

But this brings with it a danger that the simulations performed might be doing the wrong thing, and unless we have a good grasp of the theoretical background of what is being simulated we can easily be misled. In this paper, Rouder (pdf) shows that common intuitions we have for evaluating simulations of frequentist statistics often do not translate to simulations of Bayesian statistics.

The critical element addressed here is whether optional stopping is problematic for Bayesians. My argument is that both sets of authors use the wrong criteria or lens to draw their conclusions. They evaluate and interpret Bayesian statistics as if they were frequentist statistics. The more germane question is whether Bayesian statistics are interpretable as Bayesian statistics even if data are collected under optional stopping. (p. 302)

When we evaluate a frequentist procedure via simulation, it is common to set a parameter to a certain value and evaluate the number of times certain outcomes occur. For example, we can set the difference between two group means to zero, simulate a bunch of p values, and see how many fall below .05. Then we can set the difference to some nonzero number, simulate a bunch of p values, and again see how many are below .05. The first gives you the type-1 error rate for the procedure, and the second gives you the statistical power. This is appropriate for frequentist procedures because the probabilities calculated are always conditional on one or the other hypothesis being true.

One might be tempted to evaluate Bayes factors in the same way; that is, set the difference between two groups to zero and see how many BFs are above some threshold, and then set the difference to something nonzero and see how many BFs are again above some threshold.

The critical error … is studying Bayesian updating conditional on some hypothetical truth rather than conditional on data. This error is easy to make because it is what we have been taught and grown familiar with in our frequentist training. (p. 308)

Evaluating simulations of Bayes factors in this way is incorrect. Bayes factors (and posterior odds) are conditional on only the data observed. In other words, the appropriate evaluation is: “Given that I have observed this data (i.e., BF = x), what is the probability the BF was generated by H1 vs H0?”

Rouder visualizes this as follows. Flip a coin to choose the true hypothesis, then simulate a Bayes factor, and repeat these two steps many many times. At the end of the simulation, whenever BF=x is observed, check and see how many of these came from one model vs the other. The simulation shows that in this scenario if we look at all the times BF=3 is observed, there will be 3 BFs from the true model to every 1 BF from the false model. Since the prior odds are 1 to 1, the posterior odds equals the Bayes factor.

You can see in the figure above (taken from Rouder’s figure 2), the distribution of Bayes factors observed when the null is true (purple, projected downwards) vs when the alternative is true (pink, projected upwards). Remember, the true hypothesis was chosen by coin flip. You can clearly see that when a BF of 3 to 1 in favor of the null is observed, the purple column is three times bigger than the pink column (shown with the arrows).

Below (taken from Rouder’s figure 2) you see what happens when one employs optional stopping (e.g., flip a coin to pick underlying true model, then sample until BF favors one model to another by at least 10 or you reach a maximum n). The distribution of Bayes factors generated by each model becomes highly skewed, which is often taken as evidence that conclusions drawn from Bayes factors depend on the stopping rule. The incorrect interpretation would be: Given the null is true, the number of times I find BF=x in favor of the alternative (i.e., in favor of the wrong model) has gone up, therefore the BF is sensitive to optional stopping. This is incorrect because it conditions on one model being true and checks the number of times a BF is observed, rather than conditioning on the observed BF and checking how often it came from H0 vs. H1.

Look again at what matters: What is the ratio of observed BFs that come from H1 vs. H0 *for a given BF*? No matter what stopping rule is used, the answer is always the same: If the true hypothesis is chosen by a coin flip, and a BF of 10 in favor of the alternative is observed, there will be 10 times as many observed BFs in the alternative column (pink) than in the null column (purple).

In Rouder’s simulations he always used prior odds of 1 to 1, because then the posterior odds equal the Bayes factor. If one were to change the prior odds then the Bayes factor would no longer equal the posterior odds, and the shape of the distribution would again change; but importantly, while the *absolute number* of Bayes factors that end up in each bin would change, but the *ratios* of each pink column to purple column would not. No matter what stopping rule you use, the conclusions we draw from Bayes factors and posterior odds are unaffected by the stopping rule.

Feel free to employ any stopping rule you wish.

This result was recently shown again by Deng, Lu, and Chen in a paper posted to arXiv (pdf link) using similar simulations, and they go further in that they prove the theorem.

### A few choice quotes

Page 308:

Optional-stopping protocols may be hybrids where sampling occurs until the Bayes factor reaches a certain level or a certain number of samples is reached. Such an approach strikes me as justifiable and reasonable, perhaps with the caveat that such protocols be made explicit before data collection. The benefit of this approach is that more resources may be devoted to more ambiguous experiments than to clear ones.

Page 308:

The critical error … is studying Bayesian updating conditional on some hypothetical truth rather than conditional on data. This error is easy to make because it iswhat we have been taught and grown familiar with in our frequentist training. In my opinion, the key to understanding Bayesian analysis is to focus on the degree of belief for considered models, which need not and should not be calibrated relative to some hypothetical truth.

Page 306-307:

When we update relative beliefs about two models, we make an implicit assumption that they are worthy of our consideration. Under this assumption, the beliefs may be updated regardless of the stopping rule. In this case, the models are dramatically wrong, so much so that the posterior odds contain no useful information whatsoever. Perhaps the more important insight is not that optional stopping is undesirable, but that the meaningfulness of posterior odds is a function of the usefulness of the models being compared.

Great blog post (and great series of blog posts)!

I like Jeff’s paper, and agree that his p. 308 quote is one of the fundamental issues that leads to confusion when comparing Bayes and frequentists methods. Having said that, there’s one quirk that I just can’t wrap my head around in the Bayesian optional stopping routine. Plots such as Figure 5 in the Rouder et al. (2009) t-test paper show that when the generating effect size is small, we can end up with a non-monotonic function relating BF to sample size. This means that if we set one smaller threshold we will on average prefer the null hypothesis, but if we wait longer we will prefer the alternative hypothesis. This seems to be in the spirit of Jeff’s paper (after a smaller number of samples, the data are indeed more consistent with the null hypothesis, given the prior on effect size), but it’s not so clear why monotonicity isn’t an assumption in optional stopping.

Hi Simon,

The nonmonotonicity is actually required by any reasonable quantification of evidence. See my blog post here: ‘All about that “bias, bias, bias” (it’s no trouble)’.

Hi Alex,

Great post as usual. One point in the paper that you linked struck me. Since any optional stopping will result in BFs that retain their proper interpretation, this would imply that a defensible research practice would be to use optional stopping until p < .05 and then simply report a BF. A strange thought, don't you think?

Thanks, Felix. I am not sure why someone would do sampling like that, considering the longer they chase p<.05, the worse the BF will be for them when they reach it!

Nobody would (I think) use this sampling plan – I thought it was just peculiar. Under NHST we would think that this behavior would be unacceptable, yet just switching to reporting BF would somehow make this OK.

[…] On the topic of this disagreement, Jaynes gives a nice quote from L.J. Savage: “there has seldom been such complete disagreement and breakdown of communication since the tower of Babel.” I wrote about one kind of communication breakdown in last week’s Sunday Bayes entry. […]