Edwards, Lindman, and Savage (1963) on why the p-value is still so dominant

Below is an excerpt from Edwards, Lindman, and Savage (1963, pp. 236-7), on why p-value procedures continue to be dominant in the empirical sciences even after it has been repeatedly shown to be an incoherent and nonsensical statistic (note: those are my choice of words, the authors are very cordial in their commentary). The age of the article shows in numbers 1 and 2, but I think it is still valuable commentary; Numbers 3 and 4 are still highly relevant today.

From Edwards, Lindman, and Savage (1963, pp. 236-7):

If classical significance tests have rather frequently rejected true null hypotheses without real evidence, why have they survived so long and so dominated certain empirical sciences ? Four remarks seem to shed some light on this important and difficult question.

1. In principle, many of the rejections at the .05 level are based on values of the test statistic far beyond the borderline, and so correspond to almost unequivocal evidence [i.e., passing the interocular trauma test]. In practice, this argument loses much of its force. It has become customary to reject a null hypothesis at the highest significance level among the magic values, .05, .01, and .001, which the test statistic permits, rather than to choose a significance level in advance and reject all hypotheses whose test statistics fall beyond the criterion value specified by the chosen significance level. So a .05 level rejection today usually means that the test statistic was significant at the .05 level but not at the .01 level. Still, a test statistic which falls just short of the .01 level may correspond to much stronger evidence against a null hypothesis than one barely significant at the .05 level. …

2. Important rejections at the .05 or .01 levels based on test statistics which would not have been significant at higher levels are not common. Psychologists tend to run relatively large experiments, and to get very highly significant main effects. The place where .05 level rejections are most common is in testing interactions in analyses of variance—and few experimenters take those tests very seriously, unless several lines of evidence point to the same conclusions. [emphasis added]

3. Attempts to replicate a result are rather rare, so few null hypothesis rejections are subjected to an empirical check. When such a check is performed and fails, explanation of the anomaly almost always centers on experimental design, minor variations in technique, and so forth, rather than on the meaning of the statistical procedures used in the original study.

4. Classical procedures sometimes test null hypotheses that no one would believe for a moment, no matter what the data […] Testing an unbelievable null hypothesis amounts, in practice, to assigning an unreasonably large prior probability to a very small region of possible values of the true parameter. […]The frequent reluctance of empirical scientists to accept null hypotheses which their data do not classically reject suggests their appropriate skepticism about the original plausibility of these null hypotheses. [emphasis added]

 

References

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological review, 70(3), 193-242.

The Special One-Way ANOVA (or, Shutting up Reviewer #2)

The One-Way Analysis of Variance (ANOVA) is a handy procedure that is commonly used when a researcher has three or more groups that they want to compare. If the test comes up significant, follow-up tests are run to determine which groups show meaningful differences. These follow-up tests are often corrected for multiple comparisons (the Bonferroni method is most common in my experience), dividing the nominal alpha (usually .05) by the number of tests. So if there are 5 follow up tests, each comparison’s p-value must be below .01 to really “count” as significant. This reduces the test’s power considerably, but better guards against false-positives. It is common to correct all follow-up tests after a significant main effect, no matter the experimental design, but this is unnecessary when there are only three levels. H/T to Mike Aitken Deakin (here: @mrfaitkendeakin) and  Chris Chambers (here: @chrisdc77) for sharing.

The Logic of the Uncorrected Test

In the case of the One-Way ANOVA with three levels, it is not necessary to correct for the extra t-tests because the experimental design ensures that the family-wise error rate will necessarily stay at 5% — so long as no follow-up tests are carried out when the overall ANOVA is not significant.

A family-wise error rate (FWER) is the allowed tolerance for making at least 1 erroneous rejection of the null-hypothesis in a set of tests. If we make 2, 3, or even 4 erroneous rejections, it isn’t considered any worse than 1. Whether or not this makes sense is for another blog post. But taking this definition, we can think through the scenarios (outlined in Chris’s tweet) and see why no corrections are needed:

True relationship: µ1 = µ2 = µ3 (null-hypothesis is really true, all groups equal). If the main effect is not significant, no follow-up tests are run and the FWER remains at 5%. (If you run follow-up tests at this point you do need to correct for multiple comparisons.) If the main effect is significant, it does not matter what the follow-up tests show because we have already committed our allotted false-positive. In other words, we’ve already made the higher order mistake of saying that some differences are present before we even examine the individual group contrasts. Again, the FWER accounts for making at least 1 erroneous rejection. So no matter what our follow-up tests show, the FWER remains at 5% since we have already made our first false-positive before even conducting the follow-ups.

True relationship: µ1 ≠ µ2 = µ3, OR µ1 = µ2 ≠ µ3, OR µ1 ≠ µ3 = µ2  (null-hypothesis is really false, one group stands out). If the main effect is significant then we are correct, and no false-positive is possible at this level. We go with our follow-up tests (where it is really true that one group is different from the other two), where only one pair of means is truly equal. So that single pair is the only place for a possible false-positive result. Again, our FWER remains at 5% because we only have 1 opportunity to erroneously reject a null-hypothesis.

True relationship: µ1 ≠ µ2 ≠ µ3. A false-positive is impossible in this case because all three groups are truly different. All follow-up tests necessarily keep the FWER at 0%!

There is no possible scenario where your FWER goes above 5%, so no need to correct for multiple comparisons! 

So the next time Reviewer #2 gives you a hard time about correcting for multiple comparisons on a One-Way ANOVA with three levels, you can rightfully defend your uncorrected t-tests. Not correcting the alpha saves you some power, thereby making it easier to support your interesting findings.

If you wanted to sidestep the multiple comparison problem altogether you could do a fully Bayesian analysis, in which the number of tests conducted holds no weight on the evidence of a single test. So in other words, you could jump straight to the comparisons of interest instead of doing the significant main effect → follow-up test routine. Wouldn’t that save us all a lot of hassle?

 

Lack of Power (and not the statistical kind)

One thing that never really comes up when people talk about “Questionable Research Practices,” is what to do when you’re a junior in the field and someone your senior suggests that you partake. [snip] It can be daunting to be the only one on who thinks we shouldn’t drop 2 outliers to get our p-value from .08 to .01, or who thinks we shouldn’t go collect 5 more subjects to make it “work.” When it is 1 vs 4 and you’re at the bottom of the totem pole, it rarely works out the way you want. It is hard not to get defensive, and you desperately want everyone to just come around to your thinking- but it doesn’t happen. What can the little guy say to the behemoths staring him down?

I’ve recently been put in this situation, and I am finding it to be a challenge that I don’t know how to overcome. It is difficult to explain to someone that what they are suggesting you do is [questionable] (At least not without sounding accusatory). I can explain the problems with letting our post hoc p-value guide interpretation, or the problems for replicability when the analysis plan isn’t predetermined, or the problems with cherry picking outliers, but it’s really an ethical issue at its core. I don’t want to engage in what I know is a [questionable] practice, but I don’t have a choice. I can’t afford to burn bridges when those same bridges are the only things that get me over the water and into a job.

I’ve realized that this amazing movement in the field of psychology has left me feeling somewhat helpless. When push comes to shove, the one running the lab wins and I have to yield- even against my better judgment. After six five months of data collection, am I supposed to just step away and not put my name on the work? There’s something to that, I suppose. A bit of poetic justice. But justice doesn’t get you into grad school, or get you a PhD, or get you a faculty job, or get you a grant, or get you tenure. The pressure is real for the ones at the bottom. I think more attention needs to be paid to this aspect of the psychology movement. I can’t be the only one who feels like I know what I should (and shouldn’t) be doing but don’t have a choice.

Edit: See another great point of view on this issue here http://jonathanramsay.com/questionable-research-practices-the-grad-student-perspective/

edit3: Changed some language

Using journal rank as an assessment tool- we probably shouldn’t do it

This is my summary of Brembs, Button, and Munafo (2013), “Deep impact: unintended consequences of journal rank.” Main points I took from the paper: 1) Some journals get their “impact factor” through shady means. 2) How does journal rank relate to reliability of results and rate of retractions? 3) Do higher ranking journals publish “better” findings? 4) What should we do if we think journal rank is a bunk measure?

1) How do journals get their impact factor (IF) rank? It’s an account of the number of citations that publications in that journal get per the amount of articles in the journal- and a higher impact factor is seen as more prestigious. Apparently some journals are negotiating their IF and inflating it artificially. There is quite a bit of evidence that some journals inflate their ranking by changing what kinds of articles count for their IF, such as excluding opinion pieces and news editorials. Naturally, if you reduce how many articles count towards the IF but keep the number of citations constant, there will be a stronger ratio of number of citations to number of articles. It gets worse though, as a group of researchers purchased the data from journals in an attempt to manually calculate their impact factor, and are sometimes off by up to 19% of what the journal claims! So even if you know all the info about citations and articles in a journal, you still can’t figure out their IF. Seems kinda fishy.

2) Brembs and colleagues looked at the relation a journal’s rank had on both retraction rates and decline effects. Rate of retractions in the scientific literature have gone from up drastically recently, and now the majority of all retractions are due to scientific misconduct, purposeful or otherwise. They found a strong correlation between a journal’s impact factor and retraction rate (figure 1d):Image

As we can see, as a journal’s impact factor rises so too does it’s rate of retractions. Why this happens is likely a mixture of social pressures- the push for publishing in high journals increases unreliability of findings and higher visibility of papers. If more people see your paper, there is a better chance someone is going to catch you out. A popular case right now is the retraction of a publication in Nature of a novel acid bath procedure that can create certain types of stem cells. It went through 9 months of peer-review, and yet it only took a handful of weeks for it to be retracted once everyone else got their turn at it. It turns out that one of the authors was reproducing figures and results from other work they had done in the past that didn’t get much press.

The decline effect is an observation that some initially strong reported effects (say a drug’s ability to treat cancer) can gradually decline as more studies are done, such that the initial finding is seen as a gross overestimate- and the real effect is estimated to be quite small or even zero. Here I’ve reproduced figure 1b from Brembs et al., showing a plot of the decline of the reported association between carrying a certain gene and your likelihood to succumb to alcoholism. The size of the bubbles indicates the relative journal impact factor and the higher on the y-axis the bubble is, the stronger the reported association. Clearly, as more data come in (from the lower impact journals) there is less and less evidence that the association is as strong as initially reported in the high impact journals. Image

So what should we take from this? Clearly there are higher rates of retractions in high impact journals. Additionally, some initial estimates reported in high impact journals lend themselves to a steep decline in their evidential value as smaller impact journals report consistently smaller effects as time goes on. Unfortunately, once the media gets hold of the big initial findings from prominent journals it’s unlikely the smaller estimates from less known journals get anywhere near the same press.

3) There is a perception that higher ranking journals publish more important science. There is a bit of evidence showing that a publication’s perceived importance is tied to it’s publishing journal’s impact factor, and experts rank papers from high impact journals as more important.* However, further investigation shows that journal ranking only accounts for a small amount of a paper’s number of citations (R² = .1 to .3). In other words, publishing in a high impact journal confers a small benefit on the number of citations a paper garners, likely due more to the effects high impact journals have on reading habits than due to the higher quality of the publications.

4) Brembs et al recommend that we stop using journal rank as an assessment tool, and instead “[bring] scholarly communication back to the research institutions … in which both software, raw data and their text descriptions are archived and made accessible (pg 8).” They want us to move away from closed publication that costs up to $2.8 billion annually to a more open evaluation system.

Overall I think they make a strong case that the commonly held assumptions about journal rank are misguided, and we would should be advocating for a more open reporting system. Clearly the pressures of the “publish-or-perish” culture in academia right now are making otherwise good people do shady things (and making it easier for shady people to get away with what they’d do anyways). That’s not to say the people involved aren’t responsible, but there is definitely a culture that encourages subpar methods and behavior. The first step is creating an environment in which people are comfortable publishing “small” effects and where we encourage replication and combination across multiple findings before we make any claims with relative certainty.

*However, in that study they didn’t mask the name of the journal that the papers were published in, so there could be confounding subjective valuations from the experts on the paper’s perceived importance.