The general public has no idea what “statistically significant” means

The title of this piece shouldn’t shock anyone who has taken an introductory statistics course. Statistics is full of terms that have a specific statistical meaning apart from their everyday meaning. A few examples:

Significant, confidence, power, random, mean, normal, credible, moment, bias, interaction, likelihood, error, loadings, weights, hazard, risk, bootstrap, information, jack-knife, kernel, reliable, validity; and that’s just the tip of the iceberg. (Of course, one’s list gets bigger the more statistics courses one takes.)

It should come as no surprise that the general public mistakes a term’s statistical meaning for its general english meaning when nearly every word has some sort of dual-meaning.

Philip Tromovitch (2015) has recently put out a neat paper in which he surveyed a little over 1,000 members of the general public on their understanding of the meaning of “significant,” a term which has a very precise statistical definition: assuming the null hypothesis is true (usually defined as no effect), discrepancies as large or larger than this result would be so rare that we should act as if the null hypothesis isn’t true and we won’t often be wrong.

However, in everyday english, something that is significant means that it is noteworthy or worth our attention. Rather than give a cliched dictionary definition, I asked my mother what she thought. She says she would interpret a phrase such as, “there was a significant drop in sales from 2013 to 2014” to indicate that the drop in sales was “pretty big, like quite important.” (thanks mom 🙂 ) But that’s only one person. What did Tromovitch’s survey respondents think?

Tromovitch surveyed a total of 1103 people. He asked 611 of his respondents to answer this multiple choice question, and the rest answered a variant as an open ended question. Here is the multiple choice question to his survey respondents:

When scientists declare that the finding in their study is “significant,” which of the following do you suspect is closest to what they are saying:

  • the finding is large
  • the finding is important
  • the finding is different than would be expected by chance
  • the finding was unexpected
  • the finding is highly accurate
  • the finding is based on a large sample of data

Respondents choosing the first two responses were considered to be incorrectly using general english, choosing the third answer was considered correct, and choosing any of the final three were considered other incorrect answer. He separated general public responses from those with doctorate degrees (n=15), but he didn’t get any information on what topic their degree was in, so I’ll just refer to the rest of the sample’s results from here on since the doctorate sample should really be taken with a grain of salt.

Roughly 50% of respondents gave a general english interpretation of the “significant” results (options 1 or 2), roughly 40% chose one of the other three wrong responses (options 4, 5, or 6), and less than 10% actually chose the correct answer (option 3). Even if they were totally guessing you’d expect them to get close to 17% correct (1/6), give or take.

But perhaps multiple choice format isn’t the best way to get at this, since the prompt itself provides many answers that sound perfectly reasonable. Tromovitch also asked this as an open-ended question to see what kind of responses people would generate themselves. One variant of the prompt explicitly mentions that he wants to know about statistical significance, while the other simply mentions significance. The exact wording was this:

Scientists sometimes conclude that the finding in their study is “[statistically] significant.” If you were updating a dictionary of modern American English, how would you define the term “[statistically] significant”?

Did respondents do any better when they can answer freely? Not at all. Neither prompt had a very high success rate; they had correct response rates at roughly 4% and 1%. This translates to literally 12 correct answers out of the total 492 respondents of both prompts combined (including phd responses). Tromovitch includes all of these responses in the appendix so you can read the kinds of answers that were given and considered to be correct.

If you take a look at the responses you’ll see that most of them imply some statement about the probability of one hypothesis or the other being true, which isn’t allowed by the correct definition of statistical significance! For example, one answer coded as correct said, “The likelihood that the result/findings are not due to chance and probably true” is blatantly incorrect. The probability that the results are not due to chance is not what statistical significance tells you at all. Most of the responses coded as “correct” by Tromovitch are quite vague, so it’s not clear that even those correct responders have a good handle on the concept. No wonder the general public looks at statistics as if they’re some hand-wavy magic. They don’t get it at all.

snape 2

My takeaway from this study is the title of this piece: the general public has no idea what statistical significance means. That’s not surprising when you consider that researchers themselves often don’t know what it means! Even professors who teach research methods and statistics get this wrong. Results from Haller & Krauss (2002), building off of Oakes (1986), suggest that it is normal for students, academic researchers, and even methodology instructors to endorse incorrect interpretations of p-values and significance tests. That’s pretty bad. It’s one thing for first-year students or the lay public to be confused, but educated academics and methodology instructors too? If you don’t buy the survey results, open up any journal issue in any psychology journal and you’ll find tons of examples of misinterpretation and confusion.

Recently Hoekstra, Morey, Rouder, & Wagenmakers (2014) demonstrated that confidence intervals are similarly misinterpreted by researchers, despite recent calls (Cumming, 2014) to totally abandon significance tests in lieu of confidence intervals. Perhaps we could toss out the whole lot and start over with something that actually makes sense? Maybe we could try teaching something that people can actually understand?

I’ve heard of this cool thing called Bayesian statistics we could try.

 

References

Cumming, G. (2014). The new statistics: Why and how. Psychological Science25(1), 7-29.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1-20.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. Wiley.

Tromovitch, P. (2015). The lay public’s misinterpretation of the meaning of ‘significant’: A call for simple yet significant changes in scientific reporting. Journal of Research Practice, 1(1), 1.

Should we buy what Greg Francis is selling? (Nope)

If you polled 100 scientists at your next conference with the single question, “Is there publication bias in your field?” I would predict nearly 100% respondents to reply “Yes.” How do they know? Did they need to read about a thorough investigation of many journals to come to that conclusion? No, they know because they have all experienced publication bias firsthand.

Until recently, researchers had scant opportunity to publish their experiments that didn’t “work” (and most times they still can’t, but now at least they can share them online unpublished). Anyone who has tried to publish a result in which all of their main findings were not “significant,” or who has had a reviewer ask them to collect more subjects in order to lower their p-value (a big no-no), or who neglect to submit to a conference when the results were null, or who have seen colleagues tweak and re-run experiments that failed to reach significance only to stop when one does, knows publication bias exists. They know that if they don’t have a uniformly “positive” result then it won’t be taken seriously. The basic reality is this: If you do research in any serious capacity, you have experienced (and probably contributed to) publication bias in your field. 

Greg Francis thinks that we should be able to point out certain research topics or journals (that we already know to be biased toward positive results) and confirm that they are biased- using the Test of Excess Significance. This is a method developed by Ioannidis and Trikalinos (2007). The logic of the test is that of a traditional null-hypothesis test, and I’ll quote from Francis’s latest paper published in PLOS One (Francis et al., 2014):

We start by supposing proper data collection and analysis for each experiment along with full reporting of all experimental outcomes related to the theoretical ideas. Such suppositions are similar to the null hypothesis in standard hypothesis testing. We then identify the magnitude of the reported effects and estimate the probability of success for experiments like those reported. Finally, we compute a joint success probability Ptes, across the full set of experiments, which estimates the probability that experiments like the ones reported would produce outcomes at least as successful as those actually reported. … The Ptes value plays a role similar to the P value in standard hypothesis testing, with a small Ptes suggesting that the starting suppositions are not entirely correct and that, instead, there appears to be a problems with data collection, analysis, or publication of relevant findings. In essence, if Ptes is small, then the published findings … appear “too good to be true” (pg. 3).

So it is a basic null-hypothesis significance test. I personally don’t see the point of this test since we already know with certainty that the answer to the question, “Is there publication bias in this topic?” is unequivocally “Yes.” So every case that the test finds not to be biased is a false-negative. But as Daniel Lakens said, “anyone is free to try anything in science,” a sentiment with which I agree wholeheartedly. And I would be a real hypocrite if I thought Francis shouldn’t share his new favorite method even if it turns out it really doesn’t work very well. But if he is going to continue to apply this test and actually name authors who he thinks are engaging specific questionable publishing practices, then he should at the very least include a “limitations of this method” section in every paper, wherein he at least cites his critics. He should also at least ask the original authors he is investigating for comments, since the original authors are the only ones who know the true state of their publication process. I am surprised that the reviewers and editor of this manuscript did not stop and ask themselves (or Francis), “It can’t be so cut and dried, can it?”

Why the Test for Excess Significance does not work

So on to the fun stuff. There are many reasons why this test cannot achieve its intended goals, and many reasons why we should take Francis’s claims with a grain of salt. This list is not at all arranged in order of importance, but in order of his critics listed in the JMP special issue (excluding Ioannidis and Gelman because of space and relevance concerns). I selected the points that I think most clearly highlight the poor validity of this testing procedure. This list gets long, so you can skip to the Conclusion (tl;dr) below for a summary.

Vandekerckhove, Guan, Styrcula, 2013

  1. Using Monte Carlo simulations, Vandekerckhove and colleagues show that when used to censor studies that seem too good to be true in a 100% publication biased environment, the test censors almost nothing and the pooled effect size estimates remain as biased as before correction.
  2. Francis uses a conservative cutoff of .10 when he declares that a set of studies suffers from systematic bias. Vandekerckhove and colleagues simulate how estimates of pooled effect size change if we make the test more conservative by using a cutoff of .80. This has the counter-intuitive effect of increasing the bias in the pooled effect size estimate. In the words of the authors, “Perversely, censoring all but the most consistent-seeming papers … causes greater bias in the effect size estimate” (Italics original).
  3. Bottom line: This test cannot be used to adjust pooled effect size estimates by accounting for publication bias.

Simonsohn, 2013

  1. Francis acknowledges that there can be times when the test returns a significant result when publication bias is small. Indeed, there is no way to distinguish between different amounts of publication bias by comparing different Ptes values (remember the rules of comparing p-values). Francis nevertheless argues that we should assume any significant Ptes result to indicate an important level of publication bias. Repeat after me: Statistically significant ≠ practically significant. The fact of the matter is, “the mere presence of publication bias does not imply it is consequential” and by extension “does not warrant fully ignoring the underlying data” (Italics original). Francis continues to ignore these facts. [as an aside; If he can come up with a way to quantify the amount of bias in an article (and not just state bias is present) then maybe the method could be taken seriously.]
  2. Francis’s critiques themselves suffer from publication bias, invalidating the reported Ptes-values. While Francis believes this is not relevant because he is critiquing unrelated studies, they are related enough to be written up and published together. While the original topics may indeed be unrelated, “The critiques by Francis, by contrast, are by the same author, published in the same year, conducting the same statistical test, to examine the exact same question.” Hardly unrelated, it would seem.
  3.  If Francis can claim that his reported p-values are accurate because the underlying studies are unrelated, then so too can the original authors. Most reports with multiple studies test effects under different conditions or with different moderators. It goes both ways.

Johnson, 2013 (pdf hosted with permission of the author)

  1. Johnson begins by expressing how he feels being asked to comment on this method: “It is almost as if all parties involved are pretending that p-values reported in the psychological literature have some well-defined meaning and that our goal is to ferret out the few anomalies that have somehow misrepresented a type I error. Nothing, of course, could be farther from the truth.” The “truth is this: as normally reported, p-values and significance tests provide the consumer of these statistics absolutely no protection against rejecting “true” null hypotheses at less than any specified rate smaller than 1.0. P-values … only provide the experimenter with such a protection … if she behaves in a scientifically principled way” (Italics added). So Johnson rejects the premise that the test of excess significance is evaluating a meaningful question at all.
  2. This test uses a nominal alpha of .10, quite conservative for most classic statistical tests. Francis’s simulations show, however, that (when assumptions are met and under ideal conditions) the actual type I error rate is far, far lower than the nominal level. This introduces questions of interpretability: How do we interpret the alpha level under different (non-ideal) conditions if the nominal alpha is not informative? Could we adjust it to reflect its actual alpha level? Probably not.
  3. This test is not straightforward to implement, and one must be knowledgeable about the research question in the paper being investigated and which statistics are relevant to that question. Francis’s application to the Topolinski and Sparenberg (2012) article, for example, is wrought with possible researcher degrees of freedom regarding which test statistics he includes in his analysis.
  4. If researchers report multiple statistical tests based on the same underlying data, the assumption of independence is violated to an unknown degree, and the reported Ptes-values could range from barely altered at best, to completely invalidated at worst. Heterogeneity of statistical power for tests that are independent also invalidates the resulting Ptes-values, and his method has no way to account for power heterogeneity.
  5. There is no way to evaluate his sampling process, which is vital in evaluating any p-value (including Ptes). How did he come to analyze this paper, or this journal, or this research topic? How many did he look at before he decided to look at this particular one? Without this knowledge we cannot assess the validity of his reported Ptes-values.

Morey, 2013

  1. Bias is a property of a process, not any individual sample. To see this, Morey asks us to imagine that we ask people to generate “random” sequences of 0s and 1s. We know that humans are biased when they do this, and typically alternate 0 and 1 too often. Say we have the sequence 011101000. This shows 4 alternations, exactly as many we would expect from a random process (50%, or 4/8). If we know a human generated this sequence, then regardless of the fact that it conforms perfectly to a random sequence, it is still biased. Humans are biased regardless of the sequence they produce. Publication processes are biased regardless of the bias level in studies they produce. Asking which journals or papers or topics show bias is asking the wrong question. We should ask if the publication process is biased, the answer to which we already know is “Yes.” We should focus on changing the process, not singling papers/topics/journals that we already know come from a biased process.
  2. The test assumes a fixed sample size (as does almost every p-value), but most researchers run studies sequentially. Most sets of studies are a result of getting a particular result, tweaking the protocol, getting another result, and repeat until satisfied or out of money/time. We know that p-values are not valid when the sample size is not fixed in advance, and this holds for Francis’s Ptes all the same. It is probably not possible to adjust the test to account for the sequential nature of real world studies, although I would be interested to see a proof.
  3. The test equates violations of the binomial assumption with the presence of publication bias, which is just silly. Imagine we use the test in a scenario like above (sequential testing) where we know the assumption is violated but we know that all relevant experiments for this paper are published (say, we are the authors). We could reject the (irrelevant) null hypothesis when we can be sure that the study suffers from no publication bias. Further, through simulation Morey shows that when true power is .4 or less, “examining experiment sets of 5 or greater will always lead to a significant result [Ptes-value], even when there is no publication bias” (Italics original).
  4. Ptes suffers from all of the limitations of p-values, chief of which are that different p-values are not comparable and p is not an effect size (or a measure of evidence at all). Any criticisms of p-values and their interpretation (of which there are too many to list) apply to Ptes.

Conclusions (tl;dr)

The test of excess significance suffers from many problems, ranging from answering the wrong questions about bias, to untenable assumptions, to poor performance in correcting effect size estimates for bias, to challenges of interpreting significant Ptes-values. Francis published a rejoinder in which he tries to address these concerns, but I find his rebuttal lacking. For space constraints (this is super long already) I won’t list the points in his reply but I encourage you to read it if you are interested in this method. He disagrees with pretty much every point I’ve listed above, and often claims they are addressing the wrong questions. I contend that he falls into the same trap he warns others to avoid in his rejoinder, that is, “[the significance test can be] inappropriate because the data do not follow the assumptions of the analysis. … As many statisticians have emphasized, scientists need to look at their data and not just blindly apply significance tests.” I completely agree.

Edits: 12/7 correct mistake in Morey summary. 12/8 add links to reviewed commentaries.

References

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169.

Francis, G. (2013). We should focus on the biases that matter: A reply to commentaries. Journal of Mathematical Psychology, 57(5), 190-195.

Francis G, Tanzman J, Matthews WJ (2014) Excess Success for Psychology Articles in the Journal Science. PLoS ONE 9(12): e114255. doi:10.1371/journal.pone.0114255

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328-331.

Ioannidis, J. P., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245-253.

Johnson, V. E. (2013). On biases in assessing replicability, statistical consistency and publication bias. Journal of Mathematical Psychology, 57(5), 177-179.

Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology,57(5), 180-183.

Simonsohn, U. (2013). It really just does not follow, comments on. Journal of Mathematical Psychology, 57(5), 174-176.

Vandekerckhove, J., Guan, M., & Styrcula, S. A. (2013). The consistency test may be too weak to be useful: Its systematic application would not improve effect size estimation in meta-analyses. Journal of Mathematical Psychology,57(5), 170-173.

Using journal rank as an assessment tool- we probably shouldn’t do it

This is my summary of Brembs, Button, and Munafo (2013), “Deep impact: unintended consequences of journal rank.” Main points I took from the paper: 1) Some journals get their “impact factor” through shady means. 2) How does journal rank relate to reliability of results and rate of retractions? 3) Do higher ranking journals publish “better” findings? 4) What should we do if we think journal rank is a bunk measure?

1) How do journals get their impact factor (IF) rank? It’s an account of the number of citations that publications in that journal get per the amount of articles in the journal- and a higher impact factor is seen as more prestigious. Apparently some journals are negotiating their IF and inflating it artificially. There is quite a bit of evidence that some journals inflate their ranking by changing what kinds of articles count for their IF, such as excluding opinion pieces and news editorials. Naturally, if you reduce how many articles count towards the IF but keep the number of citations constant, there will be a stronger ratio of number of citations to number of articles. It gets worse though, as a group of researchers purchased the data from journals in an attempt to manually calculate their impact factor, and are sometimes off by up to 19% of what the journal claims! So even if you know all the info about citations and articles in a journal, you still can’t figure out their IF. Seems kinda fishy.

2) Brembs and colleagues looked at the relation a journal’s rank had on both retraction rates and decline effects. Rate of retractions in the scientific literature have gone from up drastically recently, and now the majority of all retractions are due to scientific misconduct, purposeful or otherwise. They found a strong correlation between a journal’s impact factor and retraction rate (figure 1d):Image

As we can see, as a journal’s impact factor rises so too does it’s rate of retractions. Why this happens is likely a mixture of social pressures- the push for publishing in high journals increases unreliability of findings and higher visibility of papers. If more people see your paper, there is a better chance someone is going to catch you out. A popular case right now is the retraction of a publication in Nature of a novel acid bath procedure that can create certain types of stem cells. It went through 9 months of peer-review, and yet it only took a handful of weeks for it to be retracted once everyone else got their turn at it. It turns out that one of the authors was reproducing figures and results from other work they had done in the past that didn’t get much press.

The decline effect is an observation that some initially strong reported effects (say a drug’s ability to treat cancer) can gradually decline as more studies are done, such that the initial finding is seen as a gross overestimate- and the real effect is estimated to be quite small or even zero. Here I’ve reproduced figure 1b from Brembs et al., showing a plot of the decline of the reported association between carrying a certain gene and your likelihood to succumb to alcoholism. The size of the bubbles indicates the relative journal impact factor and the higher on the y-axis the bubble is, the stronger the reported association. Clearly, as more data come in (from the lower impact journals) there is less and less evidence that the association is as strong as initially reported in the high impact journals. Image

So what should we take from this? Clearly there are higher rates of retractions in high impact journals. Additionally, some initial estimates reported in high impact journals lend themselves to a steep decline in their evidential value as smaller impact journals report consistently smaller effects as time goes on. Unfortunately, once the media gets hold of the big initial findings from prominent journals it’s unlikely the smaller estimates from less known journals get anywhere near the same press.

3) There is a perception that higher ranking journals publish more important science. There is a bit of evidence showing that a publication’s perceived importance is tied to it’s publishing journal’s impact factor, and experts rank papers from high impact journals as more important.* However, further investigation shows that journal ranking only accounts for a small amount of a paper’s number of citations (R² = .1 to .3). In other words, publishing in a high impact journal confers a small benefit on the number of citations a paper garners, likely due more to the effects high impact journals have on reading habits than due to the higher quality of the publications.

4) Brembs et al recommend that we stop using journal rank as an assessment tool, and instead “[bring] scholarly communication back to the research institutions … in which both software, raw data and their text descriptions are archived and made accessible (pg 8).” They want us to move away from closed publication that costs up to $2.8 billion annually to a more open evaluation system.

Overall I think they make a strong case that the commonly held assumptions about journal rank are misguided, and we would should be advocating for a more open reporting system. Clearly the pressures of the “publish-or-perish” culture in academia right now are making otherwise good people do shady things (and making it easier for shady people to get away with what they’d do anyways). That’s not to say the people involved aren’t responsible, but there is definitely a culture that encourages subpar methods and behavior. The first step is creating an environment in which people are comfortable publishing “small” effects and where we encourage replication and combination across multiple findings before we make any claims with relative certainty.

*However, in that study they didn’t mask the name of the journal that the papers were published in, so there could be confounding subjective valuations from the experts on the paper’s perceived importance.