Statistical paradoxes and omnibus tests

I’ve been thinking about statistical paradoxes lately. They crop up all over the place, and they make for fun little puzzles. In this blog I’ll talk about two paradoxes that can occur when doing omnibus statistical tests. Relevant code is attached as a gist at the bottom of the page.

1) A common paradox

There’s a pretty common –and annoying!– statistical phenomenon that most people are probably familiar with when testing whether multiple means are equal to each other (or to some specific value). It is not uncommon that one runs an overall omnibus test and obtains a significant result, allowing for rejection of the overall null hypothesis of equality. However, somewhat paradoxically(?), at the same time none of the followup tests on the individual group means/differences show a significant effect (even without multiplicity correction). So, with mild frustration, we can claim that there’s some difference among the groups but we cannot pinpoint where.

This “paradox” of course is totally understandable. The overall omnibus test looks at all the group deviations at once, so it is possible for the model’s overall deviation to be large enough that we can reject the omnibus null, even if none of the groups show a particularly large deviation themselves. In other words, small deviations among the groups add up to a “large” overall deviation. The most obvious case is the cross-over interaction, with subgroup means showing a1 – a2 > 0 and b1 – b2 < 0. Because they go in opposite directions, the difference between (a1-a2) and (b1-b2) can be large while neither individual difference is itself very large.

After some careful consideration, I’d bet most people come to see this kind of omnibus test behavior to be perfectly reasonable, and not actually a paradox.

But it is simple enough to simulate, so let’s do that and try to “see” the paradox.

Consider for simplicity testing only two group means, where the omnibus null hypothesis is that they are both equal to zero. Assume also for simplicity that the data from the two groups have known standard deviations, so that we can safely use z-values (rather than t-values). If the group means are independent, then to generate their sampling distribution we can simply generate pairs of z-values from standard normal distributions with mean zero and standard deviation one.

A large number of simulated pairs of z-values are shown below with dots. Because they are independent, the pairs scatter around the origin (0,0) in all directions equally, with most pairs relatively close to the origin (the red cross).


Any pair of observable z-values lives somewhere on this plane, with their address given by their coordinates (z1, z2). A natural discrepancy measure in this scenario between the observed pair and the null hypothesis is the distance from the origin to the pair. This is merely the length of the line connecting (0,0) to the point (z1, z2) — i.e., Euclidean distance — which is the hypotenuse of a right triangle with base z1 and height z2. The Pythagorean theorem tells us this length is given by D=√(z1² + z2²).  Hence, we would reject the omnibus null only if the observed pair of z-values lives “far enough” away from the middle point of this cloud.

For example, the arrow starts at the origin and points to the pair (1.75, 1.85). The length of this arrow is √(1.75² + 1.85²) ≈ 2.55, i.e. the Euclidean distance from the null hypothesis (0,0) to the point (1.75, 1.85) is about 2.55.

If we call the sum of two squared independent z-values D² (i.e., the squared length from the origin to the pair), this D²-value follows a chi-square distribution with two degrees of freedom (with n tests it has df=n). The omnibus test is then significant only if D² is larger than the 95th percentile of the reference chi-square distribution, which is the value 5.991 for df=2. This region is shown with the circle. Observing any pairs of z-values inside the circle does not lead to rejection of the omnibus null, whereas observing pairs located outside the circle does lead to rejection. As we would expect, about 95% of the simulated pairs live inside the circle, and 5% live outside the circle.

The squared distance from the origin to the pair (1.75, 1.85) is 2.55², or roughly 6.45. This D² is larger than the critical value 5.991, so the pair (1.75, 1.85) would lead us to reject the overall null that both means are zero.

The grey square marks the area where neither individual test would reject its null, so z-pairs outside the box would result in at least one of the individual nulls being rejected. If we only looked at the z1 coordinate, we would reject the null for that test when z1 falls outside the vertical lines, and the same goes for z2 and the horizontal lines. The pair (1.75, 1.85) would therefore not lead to rejection of either test’s individual null hypothesis because it lives inside the box.

Thus, the zone where this “paradox” occurs is anywhere we can observe a pair of z-values that falls outside the circle but inside the box. That is, the paradox occurs in the four inside corners of the grey box. In this case with two groups, that area is quite small — it happens about .25% of the time if the omnibus null is true.

Fun little side result: As you increase the number of independent tests, so not just two tests but n -> infinity tests, the probability of observing all z-values in this “paradox” zone actually approaches 0%. Heuristically, we’d expect that with enough tests, at least one of them will eventually, by chance, get a z-value bigger than 1.96.

The proof is pretty easy. Consider that for any number of tests n the paradox zone is a subset of the inner box (i.e., the “n-cube” in n dimensions) that has sides of length 2*1.96. Then, the probability this paradox occurs is bounded from above by the probability that all n individual z-values are less than 1.96 in absolute value (i.e., that they live inside the n-cube). That probability is .95n when the omnibus null is true, which clearly goes to zero as n increases. Of course, that probability is just an upper bound; the actual probability of the paradox zone can get really small really fast!

2) Rao’s little-known paradox

There is another paradox that can occur when doing omnibus tests, which is less widely known. But I think this one is much harder to resolve, even with drawing a picture! Rao’s paradox is basically the opposite of the previous paradox: The omnibus test of the overall null is non-significant, meaning we cannot reject the null hypothesis that all groups have zero mean (or equal means, etc.). But at the same time, all individual tests show a significant effect, so we can reject each group’s individual null hypothesis.

Kinda freaky right? (Maybe I should have posted this blog on halloween!…)

Rao’s paradox can occur when your tests are not independent. Imagine that you have participants come into the lab and you give them multiple tasks that measure the same general construct. Then it is not inconceivable that people high on the construct tend to score high on both tasks, and people low on the construct tend to score low on both tasks. This would naturally induce a positive correlation between the two sets of scores, and thus between their test statistics. Rao’s paradox could then occur, where you reject the null using task 1, reject the null using task 2, but fail to reject the joint null using an omnibus test.

Consider again the case of two testing two group means, but now assume the two z-values are correlated at r=.5. Now we can generate pairs of z-values from a bivariate normal distribution with means of zero, standard deviations of 1, but correlation of .5. The sampling distribution of these correlated pairs of z-values is shown below. Compared to the previous example, this sampling distribution is sort of slanted and elongated at the upper-right and lower-left corners due to the correlation. I’ve also extended the edges of the grey box out a little bit for a reason that will make sense soon.


In this case the omnibus null hypothesis is rejected whenever a pair of z-values falls outside the ellipse.  Our test statistic is still a function of the distance between the pair of z-values and the origin, but now we also need to know which direction the z-pair lies in relation to the general orientation of the sampling distribution. Clearly, some pairs of z-values that are close to the origin live outside the ellipse (northwest and southeast directions) and some that are far away live inside the ellipse (northeast and southwest directions).

Instead of the squared Euclidean distance that we used last time, now we will use the squared Mahalanobis distance as our test statistic. The Mahalanobis distance is essentially a generalization of Euclidean distance, to account for the direction and scale of the sampling distribution. In the case of two correlated z-tests, the squared Mahalanobis distance is D² = (1-r²)-1(z1² – 2rz1z2 + z2²), which once again follows a chi-square distribution with 2 degrees of freedom. So once again we reject the omnibus null if D² is larger than 5.991.

Take for example the observe pair (2.05, 2.05). When r=.5, this pair achieves a Mahalanobis distance of D² = 5.60, which is not larger than 5.99 and hence not significant. Thus, we would not reject the omnibus null. However, both z-values alone would normally be considered significant (two-tailed p = .04) and we would reject each individual null hypothesis.

From the picture, we see that any pair that lands outside the inner square, but inside the ellipse, will lead to this paradox occurring. That is, the upper right or lower left outside corners of the box.

As the correlation between tests gets larger, the sampling distribution gets stretched farther and farther in the diagonal direction. Thus, as r gets bigger we can observe larger and larger pairs of z-values (up to a limit) without rejecting the omnibus null. For instance, the sampling distribution for r=.8 is shown below. Even observing the pair (2.25, 2.25) would not reject the omnibus null, despite each individual test obtaining p=.024.


To recap: Rao’s paradox can happen when you have correlated test statistics, which actually can happen a lot. For example, consider a simple linear regression with an intercept and slope parameter. If you do not center your predictor then your sampling distribution of the intercept and slope will very likely be correlated, possibly quite highly!

Are there other cool paradoxes?

If you read this far and know of other statistical “paradoxes” that can happen with omnibus tests I’d love to hear about them (via a comment, twitter, etc.). Also let me know if you would like to see more blogs about other statistical paradoxes, not necessarily just ones related to omnibus tests but just other interesting paradoxes in general. They are pretty fun little puzzles!


Code for the figures:

Some Technical Notes on Kullback-Leibler Divergence

TLDR: I typed up some of my technical notes where I derive the Kullback-Leibler divergence for some common distributions. Find them here on PsyArXiv.

The Kullback-Leibler (KL) divergence is a concept that arises pretty frequently across many different areas of statistics. I recently found myself needing to use the KL divergence for a particular Bayesian application, so I hit up google to find resources on it. The wikipedia page is not exactly … hmm how should I say this … friendly? Fortunately, there are a few nice tutorials online explaining the general concept, such as this, or this, or this (they are all nice but the statistician in me seems to prefer the third link).

Essentially, if we have two competing distributions/models that could have generated the data, the KL divergence gives us the expected log likelihood ratio in favor of the true distribution. (A refresher on likelihoods and likelihood ratios is here.) The log likelihood ratio can be interpreted as the amount of evidence the data provide for one model versus another, so the KL divergence tells us how much evidence we can expect our data to provide in favor of the true model.

It turns out that the KL divergence is pretty damn useful for tons of practical stuff, so it’s a good thing to know. For instance, one can use the KL divergence to design the optimal experiment, in terms of having the most efficient accumulation of evidence. If design A has higher KL divergence than design B, then we expect to gain more evidence per observational unit using design A than design B.

I thought about writing a little primer on KL divergence, but I don’t know if the world needs another conceptual tutorial on this; as I already mentioned, there are some good ones out there already. However, there aren’t many resources online that walk you through how you might actually derive the KL divergence in practice (i.e., for non-toy distributions). Seriously, where are the worked examples? I really doubt anyone can get a feel for a concept like this without seeing a few cases worked out in detail.

At some point after searching for a while I got fed up, and I did what any other soon-to-be-card-carrying-statistician would do: I sat down and worked some examples out for myself. (Not gonna lie, I was pretty proud of myself for having the confidence to jump right into it. A few years ago I may have just given up after my failed google search. Anyway…)

For instance, what is the KL divergence between two normal distributions with different means but the same variance? If you google hard enough (or work it out yourself) you would find that in this case the KL divergence is the squared difference in means divided by twice the (common) variance. That is, the KL divergence is half of the squared standardized mean difference. Thus, the expected log likelihood ratio between a N(0,1) distribution and a N(2,1) distribution is (2-0)²/(2*1) = 2.

How precisely do we go from the definition of KL divergence to this result? This question started my KL quest, and that’s the point where my technical notes come in. The distributions covered in these notes include Bernoulli, Geometric, Poisson, Exponential, and Normal.

My technical notes are available via PsyArXiv. Any comments, feedback, or requests for other derivations are welcome.


Slides: “Bayesian Bias Correction: Critically evaluating sets of studies in the presence of publication bias”

I recently gave a lab presentation on the work we have been doing to attempt to mitigate the nefarious effects of publication bias, and I thought I’d share the slides here. The first iteration of the method (details given in Guan and Vandekerckhove, 2016), summarized in the first half of the slides, could be applied to single studies or to cases where a fixed effects meta-analysis would be appropriate. I have been working to extend the method to cases where one would perform a random-effects meta-analysis to account for heterogeneity in effects across studies, summarized in the second half of the slides. We’re working now to write this extension up and tidy up the code for dissemination.

Here are the slides (non-slideshare pdf available here):


New revision of How to become a Bayesian in eight easy steps

Quentin, Fabian, Peter, Beth and I recently resubmitted our manuscript titled “How to become a Bayesian in eight easy steps: An annotated reading list” that we initially submitted earlier this year. You can find an updated preprint here. The reviewer comments were pleasantly positive (and they only requested relatively minor changes), so I don’t expect we’ll have another revision. In the revised manuscript we include a little more discussion of the conceptual aspect of Bayes factors (in the summary of source 4), some new discussion on different Bayesian philosophies of how analysis should be done (in the introduction of the “Applied” section) and a few additions to the “Further reading” appendix, among other minor typographical corrections.

This was quite a minor revision. The largest change to the paper by far is our new short discussion on different Bayesian philosophies, which mainly revolve around the (ever-controversial!) issue of hypothesis testing. There is an understandable desire from users of statistics for a unitary set of rules and regulation–a simple list of procedures to follow–where if you do all the right steps you won’t piss off that scrupulous methods guy down the hall from you. Well, as it happens, statistics isn’t like that and you’ll never get that list. Statistics is not just a means to an end, as many substantive researchers tend to think, but an active scientific field itself. Statistics, like any field of study, is a human endeavor that has all sorts of debates and philosophical divides.

Rather than letting these divides turn you off from learning Bayes, I hope they prepare you for the vast analytic viewpoints you will likely encounter as Bayesian analyses become more mainstream. And who knows, maybe you’ll even feel inspired to approach your own substantive problems with a new frame of mind.  Here is an excerpt from our discussion:

Before moving on to our final four highlighted sources, it will be useful if readers consider some differences in perspective among practitioners of Bayesian statistics. The application of Bayesian methods is very much an active field of study, and as such, the literature contains a multitude of deep, important, and diverse viewpoints on how data analysis should be done, similar to the philosophical divides between Neyman–Pearson and Fisher concerning proper application of classical statistics (see Lehmann, 1993). The divide between subjective Bayesians, who elect to use priors informed by theory, and objective Bayesians, who instead prefer “uninformative” or default priors, has already been mentioned throughout the Theoretical sources section above.


A second division of note exists between Bayesians who see a place for hypothesis testing in science, and those who see statistical inference primarily as a problem of estimation. ….

You’ll have to check out the paper to see how the rest of this discussion goes (see page 10).   🙂

A quick comment on recent BF (vs p-value) error control blog posts

There have recently been two stimulating posts regarding error control for Bayes factors. (Stimulating enough to get me to write this, at least.) Daniel Lakens commented on how Bayes factors can vary across studies due to sampling error. Tim van der Zee compared the type 1 and type 2 error rates for using p-values versus using BFs. My comment is not so much to pass judgment on the content of the posts (other than this quick note that they are not really proper Bayesian simulations), but to suggest an easier way to do what they are already doing. They both use simulations to get their error rates (which can take ages when you have lots of groups), but in this post I’d like to show a way to find the exact same answers without simulation, by just thinking about the problem from a slightly different angle.

Lakens and van der Zee both set up their simulations as follows: For a two sample t-test, assume a true underlying population effect size (i.e., δ), a fixed sample size per group (n1 and n2),  and calculate a Bayes factor comparing a point null versus an alternative hypothesis that assigns δ a prior distribution of Cauchy(0, .707) [the default prior for the Bayesian t-test]. Then simulate a bunch of sample t-values from the underlying effect size, plug them into the BayesFactor R package, and see what proportion of BFs are above, below or between certain values (both happen to focus on 3 and 1/3). [This is a very common simulation setup that I see in many blogs these days.]

I’ll just use a couple of representative examples from van der Zee’s post to show how to do this. Let’s say n1 = n2 = 50 and we use the default Cauchy prior on the alternative. In this setup, one can very easily calculate the resulting BF for any observed t-value using the BayesFactor R package. A BF of 3 corresponds to an observed | t | = ~2.47; a BF of 1/3 corresponds to | t | = ~1. These are your critical t values. Any t value greater than 2.47 (or less than -2.47) will have a BF > 3. Any t value between -1 and 1 will have BF < 1/3. Any t value between 1 and 2.47 (or between -1 and -2.47) will have 1/3 < BF < 3. All we have to do now is find out what proportion of sample t values would fall in these regions for the chosen underlying effect size, which is done by finding the area of the sampling distribution between the various critical values.

easier type 1 errors

If the underlying effect size for the simulation is δ = 0 (i.e., the null hypothesis is true), then observed t-values will follow the typical central t-distribution. For 98 degrees of freedom, this looks like the following.


I have marked the critical t values for BF = 3 and BF = 1/3 found above. van der Zee denotes BF > 3 as type 1 errors when δ = 0. The type 1 error rate is found by calculating the area under this curve in the tails beyond | t | = 2.47. A simple line in r gives the answer:


The type 1 error rate is thus 1.52% (van der Zee’s simulations found 1.49%, see his third table). van der Zee notes that this is much lower than the type 1 error rate of 5% for the frequentist t test (the area in the tails beyond | t | = 1.98) because the t criterion is much higher for a Bayes factor of 3 than a p value of .05.  [As an aside, if one wanted the BF criterion corresponding to a type 1 error rate of 5%, it is BF > 1.18 in this case (i.e., this is the BF obtained from | t | = 1.98). That is, for this setup, 5% type 1 error rate is achieved nearly automatically.]

The rate at which t values fall between -2.47 and -1 and between 1 and 2.47 (i.e., find 1/3 < BF < 3) is the area of this curve between -2.47 and -1 plus the area between 1 and 2.47, found by:

[1] 0.3045337

The rate at which t values fall between -1 and 1 (i.e., find BF < 1/3) is the area between -1 and 1, found by:

[1] 0.6802267

easier type 2 errors

If the underlying effect size for the simulation is changed to δ  = .4 (another one of van der Zee’s examples, and now similar to Lakens’s example), the null hypothesis is then false and the relevant t distribution is no longer centered on zero (and is asymmetric). To find the new sampling distribution, called the noncentral t-distribution, we need to find the noncentrality parameter for the t-distribution that corresponds to δ = .4 when n1 = n2 = 50. For a two-sample t test, this is found by a simple formula, ncp = δ / √(1/n1 + 1/n2); in this case we have ncp = .4 / √(1/50 + 1/50) = 2. The noncentral t-distribution for δ=.4 and 98 degrees of freedom looks like the following.


I have again marked the relevant critical values. van der Zee denotes BF < 1/3 as type 2 errors when δ ≠ 0 (and Lakens is also interested in this area). The rate at which this occurs is once again the area under the curve between -1 and 1, found by:

[1] 0.1572583

The type 2 error rate is thus 15.7% (van der Zee’s simulation finds 16.8%, see his first table). The other rates of interest are similarly found.


You don’t necessarily need to simulate this stuff! You can save a lot of simulation time by working it out with a little arithmetic plus a few easy lines of code.