Statistical paradoxes and omnibus tests

I’ve been thinking about statistical paradoxes lately. They crop up all over the place, and they make for fun little puzzles. In this blog I’ll talk about two paradoxes that can occur when doing omnibus statistical tests. Relevant code is attached as a gist at the bottom of the page.

1) A common paradox

There’s a pretty common –and annoying!– statistical phenomenon that most people are probably familiar with when testing whether multiple means are equal to each other (or to some specific value). It is not uncommon that one runs an overall omnibus test and obtains a significant result, allowing for rejection of the overall null hypothesis of equality. However, somewhat paradoxically(?), at the same time none of the followup tests on the individual group means/differences show a significant effect (even without multiplicity correction). So, with mild frustration, we can claim that there’s some difference among the groups but we cannot pinpoint where.

This “paradox” of course is totally understandable. The overall omnibus test looks at all the group deviations at once, so it is possible for the model’s overall deviation to be large enough that we can reject the omnibus null, even if none of the groups show a particularly large deviation themselves. In other words, small deviations among the groups add up to a “large” overall deviation. The most obvious case is the cross-over interaction, with subgroup means showing a1 – a2 > 0 and b1 – b2 < 0. Because they go in opposite directions, the difference between (a1-a2) and (b1-b2) can be large while neither individual difference is itself very large.

After some careful consideration, I’d bet most people come to see this kind of omnibus test behavior to be perfectly reasonable, and not actually a paradox.

But it is simple enough to simulate, so let’s do that and try to “see” the paradox.

Consider for simplicity testing only two group means, where the omnibus null hypothesis is that they are both equal to zero. Assume also for simplicity that the data from the two groups have known standard deviations, so that we can safely use z-values (rather than t-values). If the group means are independent, then to generate their sampling distribution we can simply generate pairs of z-values from standard normal distributions with mean zero and standard deviation one.

A large number of simulated pairs of z-values are shown below with dots. Because they are independent, the pairs scatter around the origin (0,0) in all directions equally, with most pairs relatively close to the origin (the red cross).

rzero

Any pair of observable z-values lives somewhere on this plane, with their address given by their coordinates (z1, z2). A natural discrepancy measure in this scenario between the observed pair and the null hypothesis is the distance from the origin to the pair. This is merely the length of the line connecting (0,0) to the point (z1, z2) — i.e., Euclidean distance — which is the hypotenuse of a right triangle with base z1 and height z2. The Pythagorean theorem tells us this length is given by D=√(z1² + z2²).  Hence, we would reject the omnibus null only if the observed pair of z-values lives “far enough” away from the middle point of this cloud.

For example, the arrow starts at the origin and points to the pair (1.75, 1.85). The length of this arrow is √(1.75² + 1.85²) ≈ 2.55, i.e. the Euclidean distance from the null hypothesis (0,0) to the point (1.75, 1.85) is about 2.55.

If we call the sum of two squared independent z-values D² (i.e., the squared length from the origin to the pair), this D²-value follows a chi-square distribution with two degrees of freedom (with n tests it has df=n). The omnibus test is then significant only if D² is larger than the 95th percentile of the reference chi-square distribution, which is the value 5.991 for df=2. This region is shown with the circle. Observing any pairs of z-values inside the circle does not lead to rejection of the omnibus null, whereas observing pairs located outside the circle does lead to rejection. As we would expect, about 95% of the simulated pairs live inside the circle, and 5% live outside the circle.

The squared distance from the origin to the pair (1.75, 1.85) is 2.55², or roughly 6.45. This D² is larger than the critical value 5.991, so the pair (1.75, 1.85) would lead us to reject the overall null that both means are zero.

The grey square marks the area where neither individual test would reject its null, so z-pairs outside the box would result in at least one of the individual nulls being rejected. If we only looked at the z1 coordinate, we would reject the null for that test when z1 falls outside the vertical lines, and the same goes for z2 and the horizontal lines. The pair (1.75, 1.85) would therefore not lead to rejection of either test’s individual null hypothesis because it lives inside the box.

Thus, the zone where this “paradox” occurs is anywhere we can observe a pair of z-values that falls outside the circle but inside the box. That is, the paradox occurs in the four inside corners of the grey box. In this case with two groups, that area is quite small — it happens about .25% of the time if the omnibus null is true.

Fun little side result: As you increase the number of independent tests, so not just two tests but n -> infinity tests, the probability of observing all z-values in this “paradox” zone actually approaches 0%. Heuristically, we’d expect that with enough tests, at least one of them will eventually, by chance, get a z-value bigger than 1.96.

The proof is pretty easy. Consider that for any number of tests n the paradox zone is a subset of the inner box (i.e., the “n-cube” in n dimensions) that has sides of length 2*1.96. Then, the probability this paradox occurs is bounded from above by the probability that all n individual z-values are less than 1.96 in absolute value (i.e., that they live inside the n-cube). That probability is .95n when the omnibus null is true, which clearly goes to zero as n increases. Of course, that probability is just an upper bound; the actual probability of the paradox zone can get really small really fast!

2) Rao’s little-known paradox

There is another paradox that can occur when doing omnibus tests, which is less widely known. But I think this one is much harder to resolve, even with drawing a picture! Rao’s paradox is basically the opposite of the previous paradox: The omnibus test of the overall null is non-significant, meaning we cannot reject the null hypothesis that all groups have zero mean (or equal means, etc.). But at the same time, all individual tests show a significant effect, so we can reject each group’s individual null hypothesis.

Kinda freaky right? (Maybe I should have posted this blog on halloween!…)

Rao’s paradox can occur when your tests are not independent. Imagine that you have participants come into the lab and you give them multiple tasks that measure the same general construct. Then it is not inconceivable that people high on the construct tend to score high on both tasks, and people low on the construct tend to score low on both tasks. This would naturally induce a positive correlation between the two sets of scores, and thus between their test statistics. Rao’s paradox could then occur, where you reject the null using task 1, reject the null using task 2, but fail to reject the joint null using an omnibus test.

Consider again the case of two testing two group means, but now assume the two z-values are correlated at r=.5. Now we can generate pairs of z-values from a bivariate normal distribution with means of zero, standard deviations of 1, but correlation of .5. The sampling distribution of these correlated pairs of z-values is shown below. Compared to the previous example, this sampling distribution is sort of slanted and elongated at the upper-right and lower-left corners due to the correlation. I’ve also extended the edges of the grey box out a little bit for a reason that will make sense soon.

rfive

In this case the omnibus null hypothesis is rejected whenever a pair of z-values falls outside the ellipse.  Our test statistic is still a function of the distance between the pair of z-values and the origin, but now we also need to know which direction the z-pair lies in relation to the general orientation of the sampling distribution. Clearly, some pairs of z-values that are close to the origin live outside the ellipse (northwest and southeast directions) and some that are far away live inside the ellipse (northeast and southwest directions).

Instead of the squared Euclidean distance that we used last time, now we will use the squared Mahalanobis distance as our test statistic. The Mahalanobis distance is essentially a generalization of Euclidean distance, to account for the direction and scale of the sampling distribution. In the case of two correlated z-tests, the squared Mahalanobis distance is D² = (1-r²)-1(z1² – 2rz1z2 + z2²), which once again follows a chi-square distribution with 2 degrees of freedom. So once again we reject the omnibus null if D² is larger than 5.991.

Take for example the observe pair (2.05, 2.05). When r=.5, this pair achieves a Mahalanobis distance of D² = 5.60, which is not larger than 5.99 and hence not significant. Thus, we would not reject the omnibus null. However, both z-values alone would normally be considered significant (two-tailed p = .04) and we would reject each individual null hypothesis.

From the picture, we see that any pair that lands outside the inner square, but inside the ellipse, will lead to this paradox occurring. That is, the upper right or lower left outside corners of the box.

As the correlation between tests gets larger, the sampling distribution gets stretched farther and farther in the diagonal direction. Thus, as r gets bigger we can observe larger and larger pairs of z-values (up to a limit) without rejecting the omnibus null. For instance, the sampling distribution for r=.8 is shown below. Even observing the pair (2.25, 2.25) would not reject the omnibus null, despite each individual test obtaining p=.024.

reight

To recap: Rao’s paradox can happen when you have correlated test statistics, which actually can happen a lot. For example, consider a simple linear regression with an intercept and slope parameter. If you do not center your predictor then your sampling distribution of the intercept and slope will very likely be correlated, possibly quite highly!

Are there other cool paradoxes?

If you read this far and know of other statistical “paradoxes” that can happen with omnibus tests I’d love to hear about them (via a comment, twitter, etc.). Also let me know if you would like to see more blogs about other statistical paradoxes, not necessarily just ones related to omnibus tests but just other interesting paradoxes in general. They are pretty fun little puzzles!

 

Code for the figures:


library(MASS)
library(ellipse)
#Helper function to make colors more transluscent
add.alpha <- function(col, alpha=1){#from https://www.r-bloggers.com/how-to-change-the-alpha-value-of-colours-in-r/
if(missing(col))
stop("Please provide a vector of colours.")
apply(sapply(col, col2rgb)/255, 2,
function(x)
rgb(x[1], x[2], x[3], alpha=alpha))
}
#Colors I like
salmon <- add.alpha("salmon2",.35)
maroon <- add.alpha("maroon4",.8)
Grey <- add.alpha("grey20",.4)
#Helper function to compute mahalanobis dist. for vector v & matrix A
#Reduces to Euclidean dist. when A is proportional to an identity matrix
quad.form <- function(v,A){
Q <- v%*%A%*%v
return(Q)
}
#Now we can do our simulation
set.seed(1337)
r <- 0 #correlation between tests, 0=indep.
k <- 3e4 #replicates
n <- 2 #number of tests
test.z <- c(1.75,1.85) #example z values for r=0
#test.z <- c(2.05, 2.05) #example z values for r=.5
#test.z <- c(2.25,2.25) #example z values for r=.8
mu <- rep(0,n) #Mean vector
sigma <- diag(1-r,n)+ matrix(rep(r,n^2),n,n) #Covariance matrix
tau <- solve(sigma) #Invert cov. matrix
#tau <- 1/(1-r^2)*matrix(c(1,-r,-r,1),nrow=2) #Precise tau for 2 Z tests
Z <- mvrnorm(k,mu,sigma) # generate our z-values
D2 <- apply(Z,1,quad.form,A=tau) # compute our omnibus test statistic
crit <- qchisq(.95,df=n) # critical value of chi-square
#Plot the sampling distribution and add ellipse/circle and box and arrow
plot(Z[,1],Z[,2],xlim=c(-3.2,3.2),ylim=c(-3.2,3.2),col=salmon,pch=20,bty="n",xlab="Z1",ylab="Z2")
lines(ellipse(sigma,level=.95),col=maroon,lwd=5,lty=1)
lines(c(-1.96,-1.96,-1.96,1.96,1.96,1.96,1.96,-1.96),
c(-1.96,1.96,1.96,1.96,1.96,-1.96,-1.96,-1.96),lwd=3,col=Grey)
#abline(v=c(-1.96,1.96),col=Grey,lwd=3) #Extend the lines of the box
#abline(h=c(-1.96,1.96),col=Grey,lwd=3) #Extend the lines of the box
arrows(0,0,test.z[1],test.z[2],col="grey30",lwd=5)
points(0,0,pch=3,col="red",lwd=3)

view raw

OmnibusParadox

hosted with ❤ by GitHub

My friends did some cool science in 2018

I’ve had a bit of blogger’s block lately (read: the last two years). I have a tendency to write half of a blog post and then scrap it because I don’t think it’s that interesting. Well, my goal this year is to get over it. So to start I decided I want to highlight some of the very cool science my friends did in 2018. Honestly I feel like I don’t do enough to lift up my friends and celebrate their accomplishments, so here are three examples of papers my friends wrote last year that sparked a change in the way I think about one topic or another. I should say, if you aren’t keeping up with these early career researchers, you’re simply missing out on some great science. Happy 2019, everyone!

Improving psychological theory-testing via “systems of orders”

My friend Julia Haaf has been killing it lately (follow her on twitter and check out her google scholar). She just finished her PhD at Missouri and took up a really cool postdoc at the University of Amsterdam, and it feels like every other month she is posting a preprint to some awesome new paper.  One of her papers I’d like to highlight is titled “A note on using systems of orders to capture theoretical constraint in psychological science” (co-authored with Fayette Klaassen and Jeff Rouder), which she presented at APS 2018 in our invited symposium, “Bayesian methods for the pragmatic psychologist.”

This paper really blew me away. One common theme in the ongoing reproducibility debate in psychology is that we need to improve the theory development of our field. The fact is, as psychologists we tend to describe our theories as a set of ordered relationships; e.g., that men respond slower than women in condition A but not condition B. I’m not sure if that will ever change. But when we go to test these directional theories, we usually do something super simple to account for our directional prediction, like a one-sided t-test. Julia and her coauthors describe this process as intellectually inefficient,  because “by positing [a] coarse verbal theory that provides for only modest constraints on the data, we are neither risking nor learning much from the data.” We can do better.

The paper goes on to present a framework that allows us to represent these types of theoretical predictions as sets of explicit order constraints between parameters in a statistical model. Moreover, they demonstrate with nice examples how a Bayesian approach to comparing competing psychological models allows for richer tests of theories in psychology. And come on, just look at this figure. It’s awesome.

haaf_fig

Should we go to the LOO for model selection?

Another person killing it lately is my friend Quentin Gronau (he isn’t on twitter but check out his google scholar page full of interesting work!). Quentin is doing his PhD at the University of Amsterdam, and he blogs at http://www.bayesianspectacles.org from time to time. He and I are at the same stage of our PhD (third year) and we have a lot of overlapping research interests, so I’m always eager to read any paper he writes. One of Quentin’s papers that came out last year that I found quite thought-provoking was titled “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” (co-authored with E.-J. Wagenmakers).

The main idea of this paper is to examine, in the simplest cases possible, the behavior of the Bayesian version of leave-one-out cross-validation (i.e., LOO) when used as a model comparison tool. It turns out that LOO does some weird stuff. For instance, consider comparing models of random guessing vs. informed responding (e.g., H0: θ=.5 vs. H1: θ≠.5) in some binary choice scenario. If we are in a situation where data come in pairs, and if it happens that every pair has 1 success and 1 failure, then we would ideally want our model comparison tool to give more and more evidence for the guessing model as pairs continue to come in. A split pair is, after all, perfectly in line with what the guessing model would predict will happen. If you use LOO for this model comparison, however, the evidence in favor of the guessing model can cap out at a relatively low amount even with observing an infinite number of success-fail pairs.

Also, look at these pretty figures. So damn clean.

The conclusion of this paper was, basically, be careful of LOO if you use it as model comparison tool; if it does weird stuff in super simple cases then how can we be confident it’s doing something sensible in more complex cases? (The paper is more nuanced of course). This paper created quite a stir among some Bayesian circles, and prompted the journal that published it, Computational Brain and Behavior, to invite some very prominent researchers to write commentaries, which I also found quite thought-provoking (find them here, here, and here, and a rejoinder here).  All in all, this paper and the commentaries made me think deeply about model comparison tools and what we should expect from them.

Correlation, causation, and DAGs, oh my!

Another person I want to include in this post is my other friend named Julia: Julia Rohrer (you’ll follow her on twitter and on google scholar if you know what’s good for you). (Both Julias also happen to be German! Julia was apparently the 36th most popular women’s name in Germany in 2017. Wait, Quentin is German too. Wow, the education system over there must be doing something right.)  Julia R. is also in her third year of her PhD –holla!– at the Max Planck Institute for the Life Course in Leipzig, where she is studying personality psychology. ALSO she is simultaneously(!) doing an undergraduate degree in computer science. Last year Julia published what I think is one of the best introductory tutorials out there on causal modeling, titled “Thinking clearly about correlations and causation: Graphical causal models for observational data.”

I think this paper should be required reading for anyone who wants to make causal statements but is limited to collecting observational data. A big challenge when working with observational data is that you can’t rule out confounding factors using randomization like you could in a controlled experiment. This paper outlines a way to model the relationships between variables of interest using what are called “Directed Acyclic Graphs” (i.e., DAGs) to get at the causal inferences we want to make in observational studies. If we create a set of boxes representing variables of interest and arrows that connect them, then if we follow certain rules, voilà, we have ourselves a DAG and maybe a chance at inferring causation. (There’s a bit more to it than just that, of course).

All the figures in this paper are box and arrow causal plots, so I’ll spare you copying them here. Instead I will share some section headers from this paper that I really enjoyed:

  • Confounding: The Bane of Observational Data
  • Learning to Let Go: When Statistical Control Hurts
  • Conclusion: Making Causal Inferences on the Basis of Correlational Data Is Very Hard

What more can I say? Go read these papers right now! You’ll be glad you did!

Some Technical Notes on Kullback-Leibler Divergence

TLDR: I typed up some of my technical notes where I derive the Kullback-Leibler divergence for some common distributions. Find them here on PsyArXiv.

The Kullback-Leibler (KL) divergence is a concept that arises pretty frequently across many different areas of statistics. I recently found myself needing to use the KL divergence for a particular Bayesian application, so I hit up google to find resources on it. The wikipedia page is not exactly … hmm how should I say this … friendly? Fortunately, there are a few nice tutorials online explaining the general concept, such as this, or this, or this (they are all nice but the statistician in me seems to prefer the third link).

Essentially, if we have two competing distributions/models that could have generated the data, the KL divergence gives us the expected log likelihood ratio in favor of the true distribution. (A refresher on likelihoods and likelihood ratios is here.) The log likelihood ratio can be interpreted as the amount of evidence the data provide for one model versus another, so the KL divergence tells us how much evidence we can expect our data to provide in favor of the true model.

It turns out that the KL divergence is pretty damn useful for tons of practical stuff, so it’s a good thing to know. For instance, one can use the KL divergence to design the optimal experiment, in terms of having the most efficient accumulation of evidence. If design A has higher KL divergence than design B, then we expect to gain more evidence per observational unit using design A than design B.

I thought about writing a little primer on KL divergence, but I don’t know if the world needs another conceptual tutorial on this; as I already mentioned, there are some good ones out there already. However, there aren’t many resources online that walk you through how you might actually derive the KL divergence in practice (i.e., for non-toy distributions). Seriously, where are the worked examples? I really doubt anyone can get a feel for a concept like this without seeing a few cases worked out in detail.

At some point after searching for a while I got fed up, and I did what any other soon-to-be-card-carrying-statistician would do: I sat down and worked some examples out for myself. (Not gonna lie, I was pretty proud of myself for having the confidence to jump right into it. A few years ago I may have just given up after my failed google search. Anyway…)

For instance, what is the KL divergence between two normal distributions with different means but the same variance? If you google hard enough (or work it out yourself) you would find that in this case the KL divergence is the squared difference in means divided by twice the (common) variance. That is, the KL divergence is half of the squared standardized mean difference. Thus, the expected log likelihood ratio between a N(0,1) distribution and a N(2,1) distribution is (2-0)²/(2*1) = 2.

How precisely do we go from the definition of KL divergence to this result? This question started my KL quest, and that’s the point where my technical notes come in. The distributions covered in these notes include Bernoulli, Geometric, Poisson, Exponential, and Normal.

My technical notes are available via PsyArXiv. Any comments, feedback, or requests for other derivations are welcome.

 

Where have I been? (looking back on the last year)

With the new school year upon us, I figured this was a good time to reflect on all that happened with me over the past year or so. It was an exciting year, that included a lot of new collaborations and traveling all over the place. Let me bring you up to speed on what’s been going on.

Side note: If you know me primarily from reading this blog you might (justifiably) think I’ve disappeared. Only two posts since last summer? Geez, I really suck. I’m still here, but I just don’t have as much time as before to focus on the kind of in-depth technical blogging I used to do. On the bright side I have kept writing that kind of material, but in the form of papers! Maybe I can do some light less-technical blogging this year, we’ll see.

Here’s a list of some developments in my career and life over the last year (here’s an updated CV):

1. My collaborators and I published these papers

  • Introduction to the concept of likelihood and its applications [preprint] (which takes from some of my blog posts [1, 2])
  • How to become a Bayesian in eight easy steps: An annotated reading list [$$$, OA] (with Quentin Gronau, Fabian Dablander, Peter Edelsbrunner, and Beth Baribault)
  • Introduction to Bayesian inference for psychology [$$$, OA] (with Joachim Vandekerckhove)
  • J. B. S. Haldane’s contribution to the Bayes factor hypothesis test [OA] (with E.-J. Wagenmakers)
  • Making replication mainstream [preprint] (with Rolf Zwaan, Rich Lucas, and Brent Donnellan, born out of discussions from SIPS 2016)
  • Too true to be bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True [OA] (with Daniel Lakens)
  • Bayesian Inference for Psychology. Part II: Example Applications with JASP [OA] (with the JASP team)

2. And we’ve submitted some more

  • Bayesian Reanalyses from Summary Statistics: A Guide for Academic Consumers [preprint] (with Alexander Ly, Akash Raj, Maarten Marsman, Quentin Gronau, and E.-J. Wagenmakers)
  • Reported self-control does not meaningfully assess the ability to override impulses [preprint] (with Bair Saunders, Marina Milyavskaya, Daniel Randles, and Mickey Inzlicht)
  • Replication Bayes factors from Evidence Updating [preprint] (with Alexander Ly, Maarten Marsman, and E.-J. Wagenmakers)

3. J. P. de Ruiter and I started recording a podcast (with valuable help from Saul Albert and Laura de Ruiter and others)

  • We recorded a number of episodes but have hit some delays in producing the podcast, but it is coming soon! I will post about it when it is released.

4. I reviewed for four new journals

  • Advances in Methods and Practices in Psychological Science (a brand new psych journal!)
  • Journal of Experimental Psychology: General
  • Review of General Psychology
  • Social Psychological and Personality Science

5. I presented at 3 conferences

  • The Annual Meeting of the Society for Mathematical Psychology in Warwick, UK
  • The Annual Meeting of the Association for Psychological Science in Boston [poster]
  • The Annual Meeting of the Psychonomic Society in Boston

6. I taught 2 workshops on Bayesian statistics (and helped at a third)

  • A one-day workshop for the Psychology Statistics Club at University of Ottawa [materials]
  • A “deep dive” workshop for the Society for Personality and Social Psychology conference in San Antonio, TX
  • Teaching assistant for the Seventh Annual JAGS and WinBUGS Workshop in Amsterdam (I stuck around Amsterdam for ~2 months)

7. I attended the second annual meeting of The Society for the Improvement of Psychological Science (SIPS)

  • SIPS continues to be awesome
  • Our “Making replication mainstream” paper (just recently accepted!) was born out of discussions we had at the inaugural SIPS meeting

8. The JASP team and I made 3 new videos

9. I wrote a guest post for the new Bayesian Spectacles blog

10. I officiated a wedding

  • If you know me personally (in real life or on facebook) you might know that my sister got married earlier this year and I acted as the officiant for the ceremony. I’ve done a lot of academic public speaking, but this one was a special kind of nerve-wracking! (but also really fun and rewarding!)

 

(I don’t see these kinds of periodic recap posts from my blogging colleagues very often. I’m not sure why not. Maybe posts like this could feel like bragging about all the good stuff that’s happened to us, so it feels a little awkward. But even so, so what! Presumably someone reads and follows this blog because they want to know what I’m thinking and what I’m doing, and this kind of post is a good way to keep them updated.)

Slides: “Bayesian Bias Correction: Critically evaluating sets of studies in the presence of publication bias”

I recently gave a lab presentation on the work we have been doing to attempt to mitigate the nefarious effects of publication bias, and I thought I’d share the slides here. The first iteration of the method (details given in Guan and Vandekerckhove, 2016), summarized in the first half of the slides, could be applied to single studies or to cases where a fixed effects meta-analysis would be appropriate. I have been working to extend the method to cases where one would perform a random-effects meta-analysis to account for heterogeneity in effects across studies, summarized in the second half of the slides. We’re working now to write this extension up and tidy up the code for dissemination.

Here are the slides (pdf):