Practice Makes Perfect (p<.05)

What’s wrong with [null-hypothesis significance testing]? Well… it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! (Cohen 1994, pg 997)

That quote was written by Jacob Cohen in 1994.What does it mean? Let’s start from the top.

A null-hypothesis significance test (NHST) is a statistical test in which one wishes to test a research hypothesis. For example, say I hypothesize that practicing  improves performance (makes you faster) when building a specific lego set. So I go out and collect some data to see how much people improve on average from a pretest to a post test- one group with no practice (control group) and another group with practice (experimental group). I end up finding that people improve by five minutes when they practice and they don’t improve when they don’t practice. That seems to support my hypothesis that practice leads to improvement!


Typically, however, in my field (psychology) one does not simply test their research hypothesis directly, first one sets up a null-hypothesis (i.e., H0, typically the opposite of their real hypothesis: e.g., no effect, no difference between means, etc.) and collects data trying to show that the null-hypothesis isn’t true. To test my hypothesis using NHST, I would first have to imagine that I’m in a fictitious world where practicing on this measure doesn’t actually improve performance (H0 = no difference in improvement between groups). Then I calculate the likelihood of finding results at least as extreme as the ones i found. If the chance of finding results at least as extreme as mine is less than 5%, we reject the null-hypothesis and say it is unlikely to be true.

In other words, I calculate the probability of finding a difference of improvement between groups of at least 5 minutes on my lego building task- remember, in a world where practicing doesn’t make you better and the groups improvements aren’t different- and I find that my probability (p-value) is 1%. Wow! That’s pretty good. Definitely less than 5% so I can reject the null-hypothesis of no improvement when people practice.

But what do I really learn from a significance test? A p-value only tells me the chance that I should find data like mine in a hypothetical world, a world that I don’t think is true, and I don’t want to be true. Then when I find data that seem unlikely in a world where H0 is true, I conclude that it likely isn’t true. The logic of the argument is thus:

If H0 is true, then this result (statistical significance) would probably not occur.

This result has occurred.

Then H0 is probably not true [….] (Cohen, 1994 pg 998)

So: if it’s unlikely to find data like mine in a world where H0 is true, then it is unlikely that the null-hypothesis is true. We want to say is how likely our null-hypothesis is by looking at our data.  That’s inverse reasoning though. We don’t have any information about the likelihood of H0, we just did an experiment where we pretended that it was true! How can our results from a world in which H0 is true provide evidence that it isn’t true? It’s already assumed to be true in our calculations! We only make the decision to reject H0 because one day we arbitrarily decided that our cut-off was 5%, and anything smaller than that means we don’t believe H0 true.

Maybe this will make it more clear why that reasoning is bad:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)

This person is a member of Congress.

Therefore, he is probably not an American. (ibid)

That’s the same logical structure that the null-hypothesis test takes. Obviously incoherent when we put it like that right?

This problem arises because we want to say “it is unlikely that the null-hypothesis is true,” but what we really say with a p-value is, “it is unlikely to find this extreme of data when the null-hypothesis is true.” Those are very different statements. One gives a likelihood of a hypothesis given a data set, P( Hypothesis | Data) and the other gives a likelihood of data given a hypothesis, P( Data | Hypothesis). No matter how much we wish for it to be true, the two probabilities are not the same. They’re never going to be the same. P-values will never tell us what we want them to tell us. We should stop pretending they do and we should acknowledge the limited inferential ability of our NHST.


Thanks for reading, comment if you’d like.


Using journal rank as an assessment tool- we probably shouldn’t do it

This is my summary of Brembs, Button, and Munafo (2013), “Deep impact: unintended consequences of journal rank.” Main points I took from the paper: 1) Some journals get their “impact factor” through shady means. 2) How does journal rank relate to reliability of results and rate of retractions? 3) Do higher ranking journals publish “better” findings? 4) What should we do if we think journal rank is a bunk measure?

1) How do journals get their impact factor (IF) rank? It’s an account of the number of citations that publications in that journal get per the amount of articles in the journal- and a higher impact factor is seen as more prestigious. Apparently some journals are negotiating their IF and inflating it artificially. There is quite a bit of evidence that some journals inflate their ranking by changing what kinds of articles count for their IF, such as excluding opinion pieces and news editorials. Naturally, if you reduce how many articles count towards the IF but keep the number of citations constant, there will be a stronger ratio of number of citations to number of articles. It gets worse though, as a group of researchers purchased the data from journals in an attempt to manually calculate their impact factor, and are sometimes off by up to 19% of what the journal claims! So even if you know all the info about citations and articles in a journal, you still can’t figure out their IF. Seems kinda fishy.

2) Brembs and colleagues looked at the relation a journal’s rank had on both retraction rates and decline effects. Rate of retractions in the scientific literature have gone from up drastically recently, and now the majority of all retractions are due to scientific misconduct, purposeful or otherwise. They found a strong correlation between a journal’s impact factor and retraction rate (figure 1d):Image

As we can see, as a journal’s impact factor rises so too does it’s rate of retractions. Why this happens is likely a mixture of social pressures- the push for publishing in high journals increases unreliability of findings and higher visibility of papers. If more people see your paper, there is a better chance someone is going to catch you out. A popular case right now is the retraction of a publication in Nature of a novel acid bath procedure that can create certain types of stem cells. It went through 9 months of peer-review, and yet it only took a handful of weeks for it to be retracted once everyone else got their turn at it. It turns out that one of the authors was reproducing figures and results from other work they had done in the past that didn’t get much press.

The decline effect is an observation that some initially strong reported effects (say a drug’s ability to treat cancer) can gradually decline as more studies are done, such that the initial finding is seen as a gross overestimate- and the real effect is estimated to be quite small or even zero. Here I’ve reproduced figure 1b from Brembs et al., showing a plot of the decline of the reported association between carrying a certain gene and your likelihood to succumb to alcoholism. The size of the bubbles indicates the relative journal impact factor and the higher on the y-axis the bubble is, the stronger the reported association. Clearly, as more data come in (from the lower impact journals) there is less and less evidence that the association is as strong as initially reported in the high impact journals. Image

So what should we take from this? Clearly there are higher rates of retractions in high impact journals. Additionally, some initial estimates reported in high impact journals lend themselves to a steep decline in their evidential value as smaller impact journals report consistently smaller effects as time goes on. Unfortunately, once the media gets hold of the big initial findings from prominent journals it’s unlikely the smaller estimates from less known journals get anywhere near the same press.

3) There is a perception that higher ranking journals publish more important science. There is a bit of evidence showing that a publication’s perceived importance is tied to it’s publishing journal’s impact factor, and experts rank papers from high impact journals as more important.* However, further investigation shows that journal ranking only accounts for a small amount of a paper’s number of citations (R² = .1 to .3). In other words, publishing in a high impact journal confers a small benefit on the number of citations a paper garners, likely due more to the effects high impact journals have on reading habits than due to the higher quality of the publications.

4) Brembs et al recommend that we stop using journal rank as an assessment tool, and instead “[bring] scholarly communication back to the research institutions … in which both software, raw data and their text descriptions are archived and made accessible (pg 8).” They want us to move away from closed publication that costs up to $2.8 billion annually to a more open evaluation system.

Overall I think they make a strong case that the commonly held assumptions about journal rank are misguided, and we would should be advocating for a more open reporting system. Clearly the pressures of the “publish-or-perish” culture in academia right now are making otherwise good people do shady things (and making it easier for shady people to get away with what they’d do anyways). That’s not to say the people involved aren’t responsible, but there is definitely a culture that encourages subpar methods and behavior. The first step is creating an environment in which people are comfortable publishing “small” effects and where we encourage replication and combination across multiple findings before we make any claims with relative certainty.

*However, in that study they didn’t mask the name of the journal that the papers were published in, so there could be confounding subjective valuations from the experts on the paper’s perceived importance.

“New” Statistics and Research Integrity

In about a week I’ll be leading a journal club discussion on this paper, “The New Statistics: Why and How“. I think it would behoove me to do a quick run through the paper before we get to the seminar table so I don’t get lost.

The main focus points from the paper:

1. We should promote research integrity by addressing three problems.

First, we should make sure all research results are published (if not in a big name journal then at least in an online depository). If only big, exciting findings make it into journals we’ll have a biased set of results which leave us misinformed. Second, we need to avoid bias in data selection. Namely, we need to denote which results we predicted ahead of time and which we found after peeking at the data.The reason this is a problem is that many times the distinction isn’t made between prespecified and post-hoc analyses, allowing researchers to report results that might simply be lucky patterns. Third, we should do more replication and always report it. If all the “failed” replications never get reported, it seems reasonable to assume that some of the published literature has overestimated the size and reliability of their results. If we make a go at the same procedure and find much smaller (or larger!) effects, by reporting the results we paint a more realistic picture of the findings.

2. We should switch our thinking from, “I must find statistically significant results,” to “I should try to be as precise in my estimate as possible.”

The best way, says Cumming, is to move entirely away from the thinking involved in trying to deny a null-hypothesis (typically the opposite of what we really want to claim) that we never actually believed in the first place, and most certainly is known to be false from the outset. For example, if we want to show that have men higher levels of testosterone than women and find men avg. 80mg vs women avg. 50mg in a blood sample, we wouldn’t test the likelihood of our actual hypothesis. We would first need to set up a hypothesis we want to disprove- that men and women are not different in testosterone levels, then we would calculate the chance of finding data as extreme or more extreme as the ones we found. In other words, we instead have to ask “What is the chance that we would find a result as extreme or more extreme as we found, if we assume they actually don’t differ at all?” Kinda hard to wrap your head around right?

That’s what a p-value describes. So if we find there is only a 1% chance of finding data as extreme as ours in a hypothetical world where there is no real difference, then we say “our result is statistically significant, p <.05, so we reject the null-hypothesis that men and women have equal testosterone levels.” Note that this doesn’t actually tell us anything about the likelihood of our hypothesis– namely, that men have higher levels. It only tells us the likelihood of finding our data if we assume that there is no difference between men and women. It also doesn’t say anything about how big the difference is between men and women. This method is confusing because it relies on calculations that take into account things we don’t actually observe or think.

Cumming’s answer is to do away with p and simply report how big our effects are and then convey how precise our measurements are in the form of a confidence interval, usually set at 95%. Back to testosterone, if we found that men had 30 +/- 5 mg higher testosterone than women, then that statement conveys both the size of our effect (30 mg) and the amount of uncertainty we have about the data  (it could be off by 5 mg in either direction). Cumming thinks that this method is superior in every way to the significance tests that are so popular because it reports more information in an easier to digest format. It also lends itself to more reliable estimates of how our results would turn out if we conducted the experiment again. Replication of an original study with a p of .05 can range from <.001 to about . 40. That’s a huge range! You could get lucky and find very strong evidence or you can be unlucky and never have a chance in hell of getting results published. Alternatively, if the original report estimated an effect of 30 +/- 5 then there is an 83% chance that a replication study will find a value between 25 and 35. That does seem a lot more informative than p.

He goes into a lot more about meta-analysis and effect sizes, but I don’t really want to write anymore since this post is pretty long. Maybe I’ll continue it in another! Thanks for reading.

First post at Cohen’s b: What am I doing here?

This blog is a record of what I’ve been reading and doing over the week. It might form a coherent piece or it might be a random garble. Sometimes I’ll post on just one piece that I feel deserves a whole post to itself.

What fun stuff am I reading this week?

Stats and psychology:

Biology fun:


  • My friend Duncan just joined the Peace Corps and now he is in Jamaica (I wrote him a recommendation!). He’s apparently falling in love with goats…