I had a discussion about the value of evidence *for* a null hypothesis and whether or not testing (or even considering) a null model makes sense. I think it does, and I don’t think the value depends in any way on whether nulls can ever be precisely true. All models are approximations, and all models break down somewhere as long as you keep digging. That conversation evolved and took many turns with many contributors. We went to crud factors, inference from CIs vs inference from BFs, to birth order effects, to wishing each other a happy new year. It was lovely. I love my tweeps.
I read a wonderful article today by EJ Wagenmakers and colleagues (Krypotos, Criss, & Iverson) detailing the effects of nonlinear transformations on different types of interactions (ordinal, crossover, etc). I wish they taught this in my stats classes! If your interaction doesn’t crossover, you can never be too sure about it because your latent variable could scale in a nonlinear way, thus leading to the appearance of an interaction that is simply a function of said scaling. Change the scaling, lose the interaction. Great paper, great illustrations, really expanded my mind.
Then i read a blog post on the Psychonomic Society website by Stephan Lewandowsky about why experimental power is purely a pre-experimental concept. Very reasonable write-up, explaining how comparative likelihoods condition on only the data obtained- a key feature for Bayesian inference. I didn’t have any problems with the write-up until it got to the end and said, “To draw powerful inferences without being distracted by the power (or lack thereof) of an experiment, all we need to do is to abandon the classic frequentist approach in favour of Bayesian statistics.” Now that’s just silly. As I said in my tweet about the article, Bayesians don’t have a monopoly on likelihoods. You can use likelihoods in many different ways! And if the reader took the likelihood message from the post to heart and simply supplemented their current work with it, we’d be a little better off. Now that’s not to say I disagree with the statement as it’s written. But simply agreeing that likelihoods are valuable and correct ways to think about results doesn’t imply one is of the Bayesian persuasion.
The blog sparked a lot of discussion on Twitter, and lead us down the typical rabbit-hole that discussions of Bayesian statistics lend themselves to. Insights and discussion on such things as how do Bayesians calculate power and plan sample sizes? Do BF-based stopping rules bias parameter estimates? (They do). Bias vs coherence? Should we use posterior distribution characteristics for our stopping rules? What effect sizes should we use for power analyses? Does sequential analysis solve the issue?
I read about Bayesian power analysis from some of Kruschke’s book (the one with puppies on the cover) and one of his papers. You may think from discussions on Twitter that Bayesians have no interest in long-run frequencies of any type. That’s not true, they’re just interested in them for different reasons- and the distinction comes when it’s time to actually do the inference. Power and planning is good for that: planning. However, that’s where it stops. Inferences about the data should not be influenced by anything that hasn’t actually happened. If my power analysis says I only have a 5% chance of obtaining a 95% posterior of width X with N1=N2=25, and it ends up I do in fact get a width of X with those sample sizes, the fact that this is a relatively rare occurance relative to other possibly obtained data has no bearing on the inferences I draw about the actually obtained data. This is the likelihood principle in action.
Reading Jimmie Savage’s paper, “Implications of Personal Probability for Induction.” Published in The Journal of Philosophy, he tries to answer the question: What rational basis is there for any of our beliefs about the unobserved? He tries to answer with examples of personal probability, “My opinions today are the rational consequence of what I have seen since yesterday.” It has showed it’s age a bit in writing style (or maybe that’s just a philosophy journal for ya) but the argument is focused and clear. Savage uses examples of balls in a box and male/female birth ratios to explicate how our beliefs might rationally change after witnessing, say, 20 black balls drawn in a succession from the box, among other examples.
A new paper came out in psych science called “Growth and change in attention problems, disruptive behavior, and achievement from kindergarten to fifth grade,” by Amy Claessens and Chantelle Dowsett. They tracked performance of elementary school children and tried to see how changes in severity of attention problems and disruptive behavior how impacted children’s test scores in fifth grade. The results suggest that as attention problems worsen, test scores drop (both math and reading), so attention problems appear to have an impactful negative effect on children’s test performance. Makes sense. Changes in disruptive behavior did not appear to meaningfully impact later test scores, and these behavior problems seemed relatively innocuous compared to attention problems. Interesting stuff.
Collected some data today. I wonder if male/female experimenter has any effect on ES estimates from developmental studies. We both build rapport with the kids before the study starts but many kids are much more comfortable working with women researchers. I wonder if that has been studied.
Interesting discussion on Andrew Gelman’s blog today on how we should revise our estimates for an effect after a replication. If it is pre-registered, should we take the replication ES as our best estimate? Or do we average them? Clearly if the first result is not pre-registered then the effect size will be biased, simply due to the current publishing norms. Even the best researchers cannot avoid this, it is just simply a characteristic of the system. Pre-registering eliminates most bias in ES estimates and so clearly pre-registered is better. I think we should only average uncorrected ES estimates if they are all pre-registered and seemingly measuring the same population.
Watching the Carolina vs Seattle playoff game and they show the stats of the team with a player (6 games) and without him (10 games). They show 4 stats in per game format (205 vs 150 yards /game, etc.) and 1 raw stat (13 vs 29 sacks). Surely they should have showed all of them as per game stats, since they have 6 and 10 with/without him. If they had then they’d show 2.2 vs 2.9 sacks /game. Doesn’t look quite as impressive as 13 vs 29!
Reading Berger’s 2003 paper about integrating all of the different testing paradigms (Fisher, Neyman-Pearson, Bayes) and it just seems like such a reach. He assumes that the Bayesian answer is correct (which it is :p ) and then tries to mold the other tests to fit it. P-values will never be posterior probabilities, no matter how much calibrating you do. NP tests are always bunk, just let it go already. You were very convincing back in the 80s, so convincing that it seems to others that you have gotten confused and think now that Fisher and NP were on to something. Or maybe he was just trying to appease people.
An old blog post by WM Briggs (here) was sent my way today, and it does a pretty solid job of showing why p is meaningless as an index of evidence for or against the null hypothesis. He has a great take on Fisher’s disjunction (given we see this low p value, either the null is false or a rare event has occurred) and by restating it he really gets the point across: “Either the null hypothesis is false and we see a small p-value, or the null hypothesis is true and we see a small p-value.” That is what Fisher implies, because the disjunction is conditional on seeing a low p value. But then he explains that the previous sentence reduces to, “Either the null hypothesis is true or it is false, and we see a small p-value.” But that first part is a tautology. Of course it is either true or false (given we are in frequency land where parameters and truths of hypotheses are fixed), it has to be one or the other. So we can simply leave it out. And so we are left with, “We see a small p value.” What do we do with just that bit? Nothing, really.
Relearning calculus, man it’s been a long time. But I’ve got to be able to do it if I want to get into a real quant program! I didn’t have a good appreciation for how great calculus really is when I was learning it in school, but relearning it on my own has been fabulous. The fundamentals are really fascinating, and I am excited to see how my stats understanding expands as I dust off my math skills.
More calculus. More learning. More fun. Hopefully I’ll be done with this soon and can move on to linear algebra review. I’ve got to get this down first though! Luckily I learned it all before so I’m mostly just massaging my brain back into action. Lots of, “oh right,” and “ah I remember that” moments.
Thinking about posting a new blog entry and cross posting to the winnower. I think I’ll write one on subjectivity in science, and how we should embrace it.
Reading this review of the disgust and moral judgment literature by Justin Landy and Geoffrey Goodman that’s in press at PoPS. A very telling funnel plot is on pg 53. Unfortunately, the paper uses trim-and-fill (eww) to (try to) correct for the bias. We all know trim-and-fill sucks, get with the times. And it has this awful sentence: “The two moderators themselves were unrelated to one another, χ2 (3) = .90, p = .83, which provides reassurance that any observed effects are not attributable to a lack of independence among them”. No. Please. Mercy. You can only commit so many fallacies at once, slow down.
And also this comment by Deborah Mayo on Andrew Gelman’s blog is especially great, I hope she wouldn’t mind me copying it here. Just fantastic:
If one is studying whether “situated cognition” of cleanliness may influence moral judgments, one should demonstrate a rigorous and stringent critique of each aspect of the artificial experiment and proxy variables before giving us things like: unscrambling words having to do with soap causes less harsh judgments of questionable acts such as eating your dog after he’s been run over, as measured by so and so’s psych scale. Serious sciences or even part way serious inquiries demand some kind of independent checks of measurements, not just reification based on another study within the same paradigm (assuming the reality of the phenomenon). When I see absolutely no such self scrutiny in these studies, but rather the opposite: the constant presumption that observations are caused by the “treatment”, accompanied by story-telling, for which there is huge latitude, I’m inclined to place the study under the umbrella of non-science/chump effects before even looking at the statistics.
Took a break for MLK holiday. Had to recharge the batteries.
Reading some recent work from Charles Judd, Jake Westfall, and David Kenny on treating stimuli as random effects (link). I think this is an extremely interesting and important subject. I wonder if once I get a hold of this intuitively if I’ll start seeing random effects treated as fixed effects everywhere. Time will tell.
Starting to read another paper about how conclusions change when you look through the Bayesian lens. This time by Joe Hilgard and colleagues (Engelhardt, Bartholow, & Rouder), and you can read the preprint here. I think they do a wonderful, brief, introduction into why bayes is beautiful and then show how it can substantially alter what we infer from our data.
Continuing to read Joe’s paper. I’m not familiar with the video game literature very well, but I like the bayes flavor he has brought to it.
Rechecked out a few books I had to return to the library, but forgot to pick up Lindley’s “Making Decisions.” D’oh! I’ll get it next time.
Read this (warning: link downloads a PDF automatically from Frontiers) interesting paper from Jean-Michel Hupé titled, “Statistical inferences under the Null hypothesis: Common mistakes and pitfalls in neuroimaging studies”. Interesting paper, introduced me to the concept of Type III errors. That is, rejecting the null-hypothesis but for the wrong reason (confounding factors, motivation effects, etc.). Worth a read if you’re doing MRI research!
As I’ve said before, dumping on NHST and CIs is a guaranteed way to start a Twitter discussion. And so I did it again. It really is just the best way to pass the time. Unfortunately twitter isn’t very good about retaining extremely long tweet chains, so much of the early conversation is gone.
Also finished reading a few articles I started before that I got side-tracked on. Westfall’s papers on random effect stimuli and Hildgard’s new preprint.
Another good twitter discussion. Should we worship methods of inference? [1/31 update: this thread is still going. I think it has over 450 replies. Pretty insane!]
Also, Daniel Lakens wrote an interesting post about different t-tests (Welch vs Student) and when to use them. Interesting.
Starting to wonder now about what will be the best way to prep for a quant psych phd program. More math (calculus, linear algebra), or more R? I feel like I really am getting more insight into stats when I up my math game, but the utility of R and getting on that early just seems to be very beneficial. Maybe I’ll ask the tweeps.
Wondering about the coherence of ROPE testing, as recommended by Kruschke (you can find an example article here). Something just feels off about it. When I calculate my posterior, and 80%, say, of my hdi is within my rope, I wouldn’t be able to “accept” the ROPE’d value (e.g., a null value) even though most of the posterior mass is centered in the ROPE. With a relatively diffuse prior, I could be looking at a pretty strong Bayes Factor (the ratio of posterior mass at zero to prior mass at zero) without my ROPE procedure supporting this null. I’ll have to do some more reading and try to find a principled criticism.
More thinking about rope testing. Also- leading discussion at the area meeting on the New Statistics paper from Psych Science tomorrow. I’m not sure how it’ll go.
So the area meeting was good. We spent most of the time talking about research integrity and very little time talking about the actual ideas behind confidence intervals. Oh well, hopefully people got something out of it.
A very funny part from Gerd Gigerenzer and Julian Marewski’s paper where they talk about researchers reporting a chi square test that had 2347 yes against 58 no. Just silly.
Also talked to the tweeps today about visualizing bayes factors. Felix Schönbrodt wrote a really cool post trying to show what the data might look like that correspond to different bayes factors. Now, I get that lots of math is understood by visualization. I think trying to understand what bayes factors mean by visualization is a good idea and very helpful. But trying to see a bayes factor from raw data is just an act of futility. I think even the best bayesians would have a hard time just looking at raw data and knowing the bayes factor. That’s why people write scripts to calculate the damn thing! Felix came to much the same conclusion. His post has some wonderful shiny apps that you can play with though!