In about a week I’ll be leading a journal club discussion on this paper, “The New Statistics: Why and How“. I think it would behoove me to do a quick run through the paper before we get to the seminar table so I don’t get lost.
The main focus points from the paper:
1. We should promote research integrity by addressing three problems.
First, we should make sure all research results are published (if not in a big name journal then at least in an online depository). If only big, exciting findings make it into journals we’ll have a biased set of results which leave us misinformed. Second, we need to avoid bias in data selection. Namely, we need to denote which results we predicted ahead of time and which we found after peeking at the data.The reason this is a problem is that many times the distinction isn’t made between prespecified and post-hoc analyses, allowing researchers to report results that might simply be lucky patterns. Third, we should do more replication and always report it. If all the “failed” replications never get reported, it seems reasonable to assume that some of the published literature has overestimated the size and reliability of their results. If we make a go at the same procedure and find much smaller (or larger!) effects, by reporting the results we paint a more realistic picture of the findings.
2. We should switch our thinking from, “I must find statistically significant results,” to “I should try to be as precise in my estimate as possible.”
The best way, says Cumming, is to move entirely away from the thinking involved in trying to deny a null-hypothesis (typically the opposite of what we really want to claim) that we never actually believed in the first place, and most certainly is known to be false from the outset. For example, if we want to show that have men higher levels of testosterone than women and find men avg. 80mg vs women avg. 50mg in a blood sample, we wouldn’t test the likelihood of our actual hypothesis. We would first need to set up a hypothesis we want to disprove- that men and women are not different in testosterone levels, then we would calculate the chance of finding data as extreme or more extreme as the ones we found. In other words, we instead have to ask “What is the chance that we would find a result as extreme or more extreme as we found, if we assume they actually don’t differ at all?” Kinda hard to wrap your head around right?
That’s what a p-value describes. So if we find there is only a 1% chance of finding data as extreme as ours in a hypothetical world where there is no real difference, then we say “our result is statistically significant, p <.05, so we reject the null-hypothesis that men and women have equal testosterone levels.” Note that this doesn’t actually tell us anything about the likelihood of our hypothesis– namely, that men have higher levels. It only tells us the likelihood of finding our data if we assume that there is no difference between men and women. It also doesn’t say anything about how big the difference is between men and women. This method is confusing because it relies on calculations that take into account things we don’t actually observe or think.
Cumming’s answer is to do away with p and simply report how big our effects are and then convey how precise our measurements are in the form of a confidence interval, usually set at 95%. Back to testosterone, if we found that men had 30 +/- 5 mg higher testosterone than women, then that statement conveys both the size of our effect (30 mg) and the amount of uncertainty we have about the data (it could be off by 5 mg in either direction). Cumming thinks that this method is superior in every way to the significance tests that are so popular because it reports more information in an easier to digest format. It also lends itself to more reliable estimates of how our results would turn out if we conducted the experiment again. Replication of an original study with a p of .05 can range from <.001 to about . 40. That’s a huge range! You could get lucky and find very strong evidence or you can be unlucky and never have a chance in hell of getting results published. Alternatively, if the original report estimated an effect of 30 +/- 5 then there is an 83% chance that a replication study will find a value between 25 and 35. That does seem a lot more informative than p.
He goes into a lot more about meta-analysis and effect sizes, but I don’t really want to write anymore since this post is pretty long. Maybe I’ll continue it in another! Thanks for reading.
2 thoughts on ““New” Statistics and Research Integrity”
Hi, I really liked your post and it reflects very much what I’ve been thinking about hypothesis testing and p-values. But to disregard the p-value altogether seems to me to pour the baby out with the bath water.
A p-value can be a valuable tool to guard against “making a fool of yourself” (which, according to David Colquhoun, is the main purpose of statistics). Indeed, if you proceed to set up a straw man (a null hypothesis that you already know unlikely to be true) only to knock it down (p < 0.05) then you gain very little. But in exploratory analysis, this can be quite helpful to prevent you from going down the wrong path: you make an experiment, you find an apparent effect, and you are excited, but then you calculate p = 0.25 (say) and abandon your hypothesis on these grounds.
In other words, I think that hypothesis testing can be a first step or rather one of many steps that one could and should take when exploring a hypothesis, each of which bolsters (or deflates) our confidence that there might be something to it.
This would also get rid of the dichotomy that hypothesis testing is a frequentist procedure with a Bayesian aim: we want to know P(Hypothesis | Data) but have P(Data | Hypothesis).
Thanks for the reply, Stephan, I think you make some good points.
/But in exploratory analysis, this can be quite helpful to prevent you from going down the wrong path/
So you think that a p-value can reliably be used as a sanity check, rather than the main hinge of the analysis. I can agree with that. I think that the importance of the result shouldn’t depend solely on the p-value, but I can see why we would want to use it as a sort of corroborative evidence to see if we have converging evidence towards our hypothesis.