Statistical paradoxes and omnibus tests

I’ve been thinking about statistical paradoxes lately. They crop up all over the place, and they make for fun little puzzles. In this blog I’ll talk about two paradoxes that can occur when doing omnibus statistical tests. Relevant code is attached as a gist at the bottom of the page.

1) A common paradox

There’s a pretty common –and annoying!– statistical phenomenon that most people are probably familiar with when testing whether multiple means are equal to each other (or to some specific value). It is not uncommon that one runs an overall omnibus test and obtains a significant result, allowing for rejection of the overall null hypothesis of equality. However, somewhat paradoxically(?), at the same time none of the followup tests on the individual group means/differences show a significant effect (even without multiplicity correction). So, with mild frustration, we can claim that there’s some difference among the groups but we cannot pinpoint where.

This “paradox” of course is totally understandable. The overall omnibus test looks at all the group deviations at once, so it is possible for the model’s overall deviation to be large enough that we can reject the omnibus null, even if none of the groups show a particularly large deviation themselves. In other words, small deviations among the groups add up to a “large” overall deviation. The most obvious case is the cross-over interaction, with subgroup means showing a1 – a2 > 0 and b1 – b2 < 0. Because they go in opposite directions, the difference between (a1-a2) and (b1-b2) can be large while neither individual difference is itself very large.

After some careful consideration, I’d bet most people come to see this kind of omnibus test behavior to be perfectly reasonable, and not actually a paradox.

But it is simple enough to simulate, so let’s do that and try to “see” the paradox.

Consider for simplicity testing only two group means, where the omnibus null hypothesis is that they are both equal to zero. Assume also for simplicity that the data from the two groups have known standard deviations, so that we can safely use z-values (rather than t-values). If the group means are independent, then to generate their sampling distribution we can simply generate pairs of z-values from standard normal distributions with mean zero and standard deviation one.

A large number of simulated pairs of z-values are shown below with dots. Because they are independent, the pairs scatter around the origin (0,0) in all directions equally, with most pairs relatively close to the origin (the red cross).

rzero

Any pair of observable z-values lives somewhere on this plane, with their address given by their coordinates (z1, z2). A natural discrepancy measure in this scenario between the observed pair and the null hypothesis is the distance from the origin to the pair. This is merely the length of the line connecting (0,0) to the point (z1, z2) — i.e., Euclidean distance — which is the hypotenuse of a right triangle with base z1 and height z2. The Pythagorean theorem tells us this length is given by D=√(z1² + z2²).  Hence, we would reject the omnibus null only if the observed pair of z-values lives “far enough” away from the middle point of this cloud.

For example, the arrow starts at the origin and points to the pair (1.75, 1.85). The length of this arrow is √(1.75² + 1.85²) ≈ 2.55, i.e. the Euclidean distance from the null hypothesis (0,0) to the point (1.75, 1.85) is about 2.55.

If we call the sum of two squared independent z-values D² (i.e., the squared length from the origin to the pair), this D²-value follows a chi-square distribution with two degrees of freedom (with n tests it has df=n). The omnibus test is then significant only if D² is larger than the 95th percentile of the reference chi-square distribution, which is the value 5.991 for df=2. This region is shown with the circle. Observing any pairs of z-values inside the circle does not lead to rejection of the omnibus null, whereas observing pairs located outside the circle does lead to rejection. As we would expect, about 95% of the simulated pairs live inside the circle, and 5% live outside the circle.

The squared distance from the origin to the pair (1.75, 1.85) is 2.55², or roughly 6.45. This D² is larger than the critical value 5.991, so the pair (1.75, 1.85) would lead us to reject the overall null that both means are zero.

The grey square marks the area where neither individual test would reject its null, so z-pairs outside the box would result in at least one of the individual nulls being rejected. If we only looked at the z1 coordinate, we would reject the null for that test when z1 falls outside the vertical lines, and the same goes for z2 and the horizontal lines. The pair (1.75, 1.85) would therefore not lead to rejection of either test’s individual null hypothesis because it lives inside the box.

Thus, the zone where this “paradox” occurs is anywhere we can observe a pair of z-values that falls outside the circle but inside the box. That is, the paradox occurs in the four inside corners of the grey box. In this case with two groups, that area is quite small — it happens about .25% of the time if the omnibus null is true.

Fun little side result: As you increase the number of independent tests, so not just two tests but n -> infinity tests, the probability of observing all z-values in this “paradox” zone actually approaches 0%. Heuristically, we’d expect that with enough tests, at least one of them will eventually, by chance, get a z-value bigger than 1.96.

The proof is pretty easy. Consider that for any number of tests n the paradox zone is a subset of the inner box (i.e., the “n-cube” in n dimensions) that has sides of length 2*1.96. Then, the probability this paradox occurs is bounded from above by the probability that all n individual z-values are less than 1.96 in absolute value (i.e., that they live inside the n-cube). That probability is .95n when the omnibus null is true, which clearly goes to zero as n increases. Of course, that probability is just an upper bound; the actual probability of the paradox zone can get really small really fast!

2) Rao’s little-known paradox

There is another paradox that can occur when doing omnibus tests, which is less widely known. But I think this one is much harder to resolve, even with drawing a picture! Rao’s paradox is basically the opposite of the previous paradox: The omnibus test of the overall null is non-significant, meaning we cannot reject the null hypothesis that all groups have zero mean (or equal means, etc.). But at the same time, all individual tests show a significant effect, so we can reject each group’s individual null hypothesis.

Kinda freaky right? (Maybe I should have posted this blog on halloween!…)

Rao’s paradox can occur when your tests are not independent. Imagine that you have participants come into the lab and you give them multiple tasks that measure the same general construct. Then it is not inconceivable that people high on the construct tend to score high on both tasks, and people low on the construct tend to score low on both tasks. This would naturally induce a positive correlation between the two sets of scores, and thus between their test statistics. Rao’s paradox could then occur, where you reject the null using task 1, reject the null using task 2, but fail to reject the joint null using an omnibus test.

Consider again the case of two testing two group means, but now assume the two z-values are correlated at r=.5. Now we can generate pairs of z-values from a bivariate normal distribution with means of zero, standard deviations of 1, but correlation of .5. The sampling distribution of these correlated pairs of z-values is shown below. Compared to the previous example, this sampling distribution is sort of slanted and elongated at the upper-right and lower-left corners due to the correlation. I’ve also extended the edges of the grey box out a little bit for a reason that will make sense soon.

rfive

In this case the omnibus null hypothesis is rejected whenever a pair of z-values falls outside the ellipse.  Our test statistic is still a function of the distance between the pair of z-values and the origin, but now we also need to know which direction the z-pair lies in relation to the general orientation of the sampling distribution. Clearly, some pairs of z-values that are close to the origin live outside the ellipse (northwest and southeast directions) and some that are far away live inside the ellipse (northeast and southwest directions).

Instead of the squared Euclidean distance that we used last time, now we will use the squared Mahalanobis distance as our test statistic. The Mahalanobis distance is essentially a generalization of Euclidean distance, to account for the direction and scale of the sampling distribution. In the case of two correlated z-tests, the squared Mahalanobis distance is D² = (1-r²)-1(z1² – 2rz1z2 + z2²), which once again follows a chi-square distribution with 2 degrees of freedom. So once again we reject the omnibus null if D² is larger than 5.991.

Take for example the observe pair (2.05, 2.05). When r=.5, this pair achieves a Mahalanobis distance of D² = 5.60, which is not larger than 5.99 and hence not significant. Thus, we would not reject the omnibus null. However, both z-values alone would normally be considered significant (two-tailed p = .04) and we would reject each individual null hypothesis.

From the picture, we see that any pair that lands outside the inner square, but inside the ellipse, will lead to this paradox occurring. That is, the upper right or lower left outside corners of the box.

As the correlation between tests gets larger, the sampling distribution gets stretched farther and farther in the diagonal direction. Thus, as r gets bigger we can observe larger and larger pairs of z-values (up to a limit) without rejecting the omnibus null. For instance, the sampling distribution for r=.8 is shown below. Even observing the pair (2.25, 2.25) would not reject the omnibus null, despite each individual test obtaining p=.024.

reight

To recap: Rao’s paradox can happen when you have correlated test statistics, which actually can happen a lot. For example, consider a simple linear regression with an intercept and slope parameter. If you do not center your predictor then your sampling distribution of the intercept and slope will very likely be correlated, possibly quite highly!

Are there other cool paradoxes?

If you read this far and know of other statistical “paradoxes” that can happen with omnibus tests I’d love to hear about them (via a comment, twitter, etc.). Also let me know if you would like to see more blogs about other statistical paradoxes, not necessarily just ones related to omnibus tests but just other interesting paradoxes in general. They are pretty fun little puzzles!

 

Code for the figures:

A quick comment on recent BF (vs p-value) error control blog posts

There have recently been two stimulating posts regarding error control for Bayes factors. (Stimulating enough to get me to write this, at least.) Daniel Lakens commented on how Bayes factors can vary across studies due to sampling error. Tim van der Zee compared the type 1 and type 2 error rates for using p-values versus using BFs. My comment is not so much to pass judgment on the content of the posts (other than this quick note that they are not really proper Bayesian simulations), but to suggest an easier way to do what they are already doing. They both use simulations to get their error rates (which can take ages when you have lots of groups), but in this post I’d like to show a way to find the exact same answers without simulation, by just thinking about the problem from a slightly different angle.

Lakens and van der Zee both set up their simulations as follows: For a two sample t-test, assume a true underlying population effect size (i.e., δ), a fixed sample size per group (n1 and n2),  and calculate a Bayes factor comparing a point null versus an alternative hypothesis that assigns δ a prior distribution of Cauchy(0, .707) [the default prior for the Bayesian t-test]. Then simulate a bunch of sample t-values from the underlying effect size, plug them into the BayesFactor R package, and see what proportion of BFs are above, below or between certain values (both happen to focus on 3 and 1/3). [This is a very common simulation setup that I see in many blogs these days.]

I’ll just use a couple of representative examples from van der Zee’s post to show how to do this. Let’s say n1 = n2 = 50 and we use the default Cauchy prior on the alternative. In this setup, one can very easily calculate the resulting BF for any observed t-value using the BayesFactor R package. A BF of 3 corresponds to an observed | t | = ~2.47; a BF of 1/3 corresponds to | t | = ~1. These are your critical t values. Any t value greater than 2.47 (or less than -2.47) will have a BF > 3. Any t value between -1 and 1 will have BF < 1/3. Any t value between 1 and 2.47 (or between -1 and -2.47) will have 1/3 < BF < 3. All we have to do now is find out what proportion of sample t values would fall in these regions for the chosen underlying effect size, which is done by finding the area of the sampling distribution between the various critical values.

easier type 1 errors

If the underlying effect size for the simulation is δ = 0 (i.e., the null hypothesis is true), then observed t-values will follow the typical central t-distribution. For 98 degrees of freedom, this looks like the following.

Rplot05

I have marked the critical t values for BF = 3 and BF = 1/3 found above. van der Zee denotes BF > 3 as type 1 errors when δ = 0. The type 1 error rate is found by calculating the area under this curve in the tails beyond | t | = 2.47. A simple line in r gives the answer:

2*pt(-2.47,df=98)
[.0152]

The type 1 error rate is thus 1.52% (van der Zee’s simulations found 1.49%, see his third table). van der Zee notes that this is much lower than the type 1 error rate of 5% for the frequentist t test (the area in the tails beyond | t | = 1.98) because the t criterion is much higher for a Bayes factor of 3 than a p value of .05.  [As an aside, if one wanted the BF criterion corresponding to a type 1 error rate of 5%, it is BF > 1.18 in this case (i.e., this is the BF obtained from | t | = 1.98). That is, for this setup, 5% type 1 error rate is achieved nearly automatically.]

The rate at which t values fall between -2.47 and -1 and between 1 and 2.47 (i.e., find 1/3 < BF < 3) is the area of this curve between -2.47 and -1 plus the area between 1 and 2.47, found by:

2*(pt(-1,df=98)-pt(-2.47,df=98))
[1] 0.3045337

The rate at which t values fall between -1 and 1 (i.e., find BF < 1/3) is the area between -1 and 1, found by:

pt(1,df=98)-pt(-1,df=98)
[1] 0.6802267

easier type 2 errors

If the underlying effect size for the simulation is changed to δ  = .4 (another one of van der Zee’s examples, and now similar to Lakens’s example), the null hypothesis is then false and the relevant t distribution is no longer centered on zero (and is asymmetric). To find the new sampling distribution, called the noncentral t-distribution, we need to find the noncentrality parameter for the t-distribution that corresponds to δ = .4 when n1 = n2 = 50. For a two-sample t test, this is found by a simple formula, ncp = δ / √(1/n1 + 1/n2); in this case we have ncp = .4 / √(1/50 + 1/50) = 2. The noncentral t-distribution for δ=.4 and 98 degrees of freedom looks like the following.

deltazero

I have again marked the relevant critical values. van der Zee denotes BF < 1/3 as type 2 errors when δ ≠ 0 (and Lakens is also interested in this area). The rate at which this occurs is once again the area under the curve between -1 and 1, found by:

pt(1,df=98,ncp=2)-pt(-1,df=98,ncp=2)
[1] 0.1572583

The type 2 error rate is thus 15.7% (van der Zee’s simulation finds 16.8%, see his first table). The other rates of interest are similarly found.

Conclusion

You don’t necessarily need to simulate this stuff! You can save a lot of simulation time by working it out with a little arithmetic plus a few easy lines of code.

 

 

Type-S and Type-M errors

An anonymous reader of the blog emailed me:
 –
I wonder if you’d be ok to help me to understanding this Gelman’s  graphI struggle to understand what is the plotted distribution and the exact meaning of the red area. Of course I read the related article, but it doesn’t help me much.
Rather than write a long-winded email, I figured it will be easier to explain on the blog using some step by step illustrations. With the anonymous reader’s permission I am sharing the question and this explanation for all to read. The graph in question is reproduced below. I will walk through my explanation by building up to this plot piecewise with the information we have about the specific situation referenced in the related paper. The paper, written by Andrew Gelman and John Carlin, illustrates the concepts of Type-M errors and Type-S errors. From the paper:
We frame our calculations not in terms of Type 1 and Type 2 errors but rather Type S (sign) and Type M (magnitude) errors, which relate to the probability that claims with confidence have the wrong sign or are far in magnitude from underlying effect sizes (p. 2)
So Gelman’s graph is an attempt to illustrate these types of errors. I won’t go into the details of the paper since you can read it yourself! I was asked to explain this graph though, which isn’t in the paper, so we’ll go through step by step building our own type-s/m graph in order to build an understanding. The key idea is this: if the underlying true population mean is small and sampling error is large, then experiments that achieve statistical significance must have exaggerated effect sizes and are likely to have the wrong sign. The graph in question:
gelmanPlot
A few technical details: Here Gelman is plotting a sampling distribution for a hypothetical experiment. If one were to repeatedly take a sample from a population, then each sample mean would be different from the true population mean by some amount due to random variation. When we run an experiment, we essentially pick a sample mean from this distribution at random. Picking at random, sample means tend to be near the true mean of the population, and the how much these random sample means vary follows a curve like this. The height of the curve represents the relative frequency for a sample mean in a series of random picks. Obtaining sample means far away from the true mean is relatively rare since the height of the curve is much lower the farther out we go from the population mean. The red shaded areas indicate values of sample means that achieve statistical significance (i.e., exceed some critical value).
 –
The distribution’s form is determined by two parameters: a location parameter and a scale parameter. The location parameter is simply the mean of the distribution (μ), and the scale parameter is the standard deviation of the distribution (σ). In this graph, Gelman defines the true population mean to be 2 based on his experience in this research area; the standard deviation is equal to the sampling error (standard error) of our procedure, which in this case is approximately 8.1 (estimated from empirical data; for more information see the paper, p. 6). The extent of variation in sample means is determined by the amount of sampling error present in our experiment. If measurements are noisy, or if the sample is small, or both, then sampling error goes up. This is reflected in a wider sampling distribution. If we can refine our measurements, or increase our sample size, then sampling error goes down and we see a narrower sampling distribution (smaller value of σ).

Let’s build our own Type-S and Type-M graph

In Gelman’s graph the mean of the population is 2, and this is indicated by the vertical blue line at the peak of the curve. Again, this hypothetical true value is determined by Gelman’s experience with the topic area. The null hypothesis states that the true mean of the population is zero, and this is indicated by the red vertical line. The hypothetical sample mean from Gelman’s paper is 17, which I’ve added as a small grey diamond near the x-axis. R code to make all figures is provided at the end of this post (except the gif).
first_plot
If we assume that the true population mean is actually zero (indicated by the red vertical line), instead of 2, then the sampling distribution has a location parameter of 0 and a scale parameter of 8.1. This distribution is shown below. The diamond representing our sample mean corresponds to a fairly low height on the curve, indicating that it is relatively rare to obtain such a result under this sampling distribution.
null_plot
Next we need to define cutoffs for statistically significant effects (the red shaded areas under the curve in Gelman’s plot) using the null value combined with the sampling error of our procedure. Since this is a two-sided test using an alpha of 5%, we have one cutoff for significance at approximately -15.9 (i.e., 0 – [1.96 x 8.1]) and the other cutoff at approximately 15.9 (i.e., 0 + [1.96 x 8.1]). Under the null sampling distribution, the shaded areas are symmetrical. If we obtain a sample mean that lies beyond these cutoffs we declare our result statistically significant by conventional standards. As you can see, the diamond representing our sample mean of 17 is just beyond this cutoff and thus achieves statistical significance.
third_plot
But Gelman’s graph assumes the population mean is actually 2, not zero. This is important because we can’t actually have a sign error or a magnitude error if there isn’t a true sign or magnitude. We can adjust the curve so that the peak is above 2 by shifting it over slightly to the right. The shaded areas begin in the same place on the x-axis as before (+/- 15.9), but notice that they have become asymmetrical. This is due to the fact that we shifted the entire distribution slightly to the right, shrinking the left shaded area and expanding the right shaded area.
fourth_plot
And there we have our own beautiful type-s and type-m graph. Since the true population mean is small and positive, any sample mean falling in the left tail has the wrong sign and vastly overestimates the population mean (-15.9 vs. 2). Any sample mean falling in the right tail has the correct sign, but again vastly overestimates the population mean (15.9 vs. 2). Our sample mean falls squarely in the right shaded tail. Since the standard error of this procedure (8.1) is much larger than the true population mean (2), any statistically significant result must have a sample mean that is much larger in magnitude than the true population mean, and is quite likely to have the wrong sign.
In this case the left tail contains 24% of the total shaded area under the curve, so in repeated sampling a full 24% of significant results will be in the wrong tail (and thus be a sign error). If the true population mean were still positive but larger in magnitude then the shaded area in the left tail would become smaller and smaller, as it did when we shifted the true population mean from zero to 2, and thus sign errors would be less of a problem. As Gelman and Carlin summarize,
setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7. Thus, it is quite likely that a study designed in this way would lead to an estimate that is in the wrong direction, and if “significant,” it is likely to be a huge overestimate of the pattern in the population. (p. 6)
I hope I’ve explained this clearly enough for you, anonymous reader (and other readers, of course). Leave a comment below or tweet/email me if anything is unclear!
Here is a neat gif showing our progression! Thanks for reading 🙂
 plots_gif
 (I don’t think this disclaimer is needed but here it goes: I don’t think people should actually use repeated-sampling statistical inference. This is simply an explanation of the concept. Be a Bayesian!)

R code