In about a week I’ll be leading a journal club discussion on this paper, “The New Statistics: Why and How“. I think it would behoove me to do a quick run through the paper before we get to the seminar table so I don’t get lost.
The main focus points from the paper:
1. We should promote research integrity by addressing three problems.
First, we should make sure all research results are published (if not in a big name journal then at least in an online depository). If only big, exciting findings make it into journals we’ll have a biased set of results which leave us misinformed. Second, we need to avoid bias in data selection. Namely, we need to denote which results we predicted ahead of time and which we found after peeking at the data.The reason this is a problem is that many times the distinction isn’t made between prespecified and post-hoc analyses, allowing researchers to report results that might simply be lucky patterns. Third, we should do more replication and always report it. If all the “failed” replications never get reported, it seems reasonable to assume that some of the published literature has overestimated the size and reliability of their results. If we make a go at the same procedure and find much smaller (or larger!) effects, by reporting the results we paint a more realistic picture of the findings.
2. We should switch our thinking from, “I must find statistically significant results,” to “I should try to be as precise in my estimate as possible.”
The best way, says Cumming, is to move entirely away from the thinking involved in trying to deny a null-hypothesis (typically the opposite of what we really want to claim) that we never actually believed in the first place, and most certainly is known to be false from the outset. For example, if we want to show that have men higher levels of testosterone than women and find men avg. 80mg vs women avg. 50mg in a blood sample, we wouldn’t test the likelihood of our actual hypothesis. We would first need to set up a hypothesis we want to disprove- that men and women are not different in testosterone levels, then we would calculate the chance of finding data as extreme or more extreme as the ones we found. In other words, we instead have to ask “What is the chance that we would find a result as extreme or more extreme as we found, if we assume they actually don’t differ at all?” Kinda hard to wrap your head around right?
That’s what a p-value describes. So if we find there is only a 1% chance of finding data as extreme as ours in a hypothetical world where there is no real difference, then we say “our result is statistically significant, p <.05, so we reject the null-hypothesis that men and women have equal testosterone levels.” Note that this doesn’t actually tell us anything about the likelihood of our hypothesis– namely, that men have higher levels. It only tells us the likelihood of finding our data if we assume that there is no difference between men and women. It also doesn’t say anything about how big the difference is between men and women. This method is confusing because it relies on calculations that take into account things we don’t actually observe or think.
Cumming’s answer is to do away with p and simply report how big our effects are and then convey how precise our measurements are in the form of a confidence interval, usually set at 95%. Back to testosterone, if we found that men had 30 +/- 5 mg higher testosterone than women, then that statement conveys both the size of our effect (30 mg) and the amount of uncertainty we have about the data (it could be off by 5 mg in either direction). Cumming thinks that this method is superior in every way to the significance tests that are so popular because it reports more information in an easier to digest format. It also lends itself to more reliable estimates of how our results would turn out if we conducted the experiment again. Replication of an original study with a p of .05 can range from <.001 to about . 40. That’s a huge range! You could get lucky and find very strong evidence or you can be unlucky and never have a chance in hell of getting results published. Alternatively, if the original report estimated an effect of 30 +/- 5 then there is an 83% chance that a replication study will find a value between 25 and 35. That does seem a lot more informative than p.
He goes into a lot more about meta-analysis and effect sizes, but I don’t really want to write anymore since this post is pretty long. Maybe I’ll continue it in another! Thanks for reading.