The title of this piece shouldn’t shock anyone who has taken an introductory statistics course. Statistics is full of terms that have a specific statistical meaning apart from their everyday meaning. A few examples:

Significant, confidence, power, random, mean, normal, credible, moment, bias, interaction, likelihood, error, loadings, weights, hazard, risk, bootstrap, information, jack-knife, kernel, reliable, validity; and that’s just the tip of the iceberg. (Of course, one’s list gets bigger the more statistics courses one takes.)

It should come as no surprise that the general public mistakes a term’s statistical meaning for its general english meaning when nearly every word has some sort of dual-meaning.

Philip Tromovitch (2015) has recently put out a neat paper in which he surveyed a little over 1,000 members of the general public on their understanding of the meaning of “significant,” a term which has a very precise statistical definition: assuming the null hypothesis is true (usually defined as no effect), discrepancies as large or larger than this result would be so rare that we should *act* as if the null hypothesis isn’t true and we won’t often be wrong.

However, in everyday english, something that is significant means that it is noteworthy or worth our attention. Rather than give a cliched dictionary definition, I asked my mother what she thought. She says she would interpret a phrase such as, “there was a significant drop in sales from 2013 to 2014” to indicate that the drop in sales was “pretty big, like quite important.” (thanks mom 🙂 ) But that’s only one person. What did Tromovitch’s survey respondents think?

Tromovitch surveyed a total of 1103 people. He asked 611 of his respondents to answer this multiple choice question, and the rest answered a variant as an open ended question. Here is the multiple choice question to his survey respondents:

When scientists declare that the finding in their study is “significant,” which of the following do you suspect is closest to what they are saying:

- the finding is large
- the finding is important
- the finding is different than would be expected by chance
- the finding was unexpected
- the finding is highly accurate
- the finding is based on a large sample of data

Respondents choosing the first two responses were considered to be *incorrectly using general english,* choosing the third answer was considered* correct*, and choosing any of the final three were considered *other incorrect answer. *He separated general public responses from those with doctorate degrees (n=15), but he didn’t get any information on what topic their degree was in, so I’ll just refer to the rest of the sample’s results from here on since the doctorate sample should really be taken with a grain of salt.

Roughly 50% of respondents gave a general english interpretation of the “significant” results (options 1 or 2), roughly 40% chose one of the other three wrong responses (options 4, 5, or 6), and less than 10% actually chose the correct answer (option 3). Even if they were totally guessing you’d expect them to get close to 17% correct (1/6), give or take.

But perhaps multiple choice format isn’t the best way to get at this, since the prompt itself provides many answers that sound perfectly reasonable. Tromovitch also asked this as an open-ended question to see what kind of responses people would generate themselves. One variant of the prompt explicitly mentions that he wants to know about *statistical* significance, while the other simply mentions significance. The exact wording was this:

Scientists sometimes conclude that the finding in their study is “[statistically] significant.” If you were updating a dictionary of modern American English, how would you define the term “[statistically] significant”?

Did respondents do any better when they can answer freely? Not at all. Neither prompt had a very high success rate; they had correct response rates at roughly 4% and 1%. This translates to **literally 12 correct answers** out of the total 492 respondents of both prompts combined (including phd responses). Tromovitch includes all of these responses in the appendix so you can read the kinds of answers that were given and considered to be correct.

If you take a look at the responses you’ll see that most of them imply some statement about the probability of one hypothesis or the other being true, which isn’t allowed by the correct definition of statistical significance! For example, one answer coded as correct said, “The likelihood that the result/findings are not due to chance and probably true” is blatantly incorrect. The probability that the results are not due to chance is not what statistical significance tells you at all. Most of the responses coded as “correct” by Tromovitch are quite vague, so it’s not clear that even those correct responders have a good handle on the concept. No wonder the general public looks at statistics as if they’re some hand-wavy magic. They don’t get it at all.

My takeaway from this study is the title of this piece: **the general public has no idea what statistical significance means**. That’s not surprising when you consider that** researchers themselves often don’t know what it means!** Even professors who teach research methods and statistics get this wrong. Results from Haller & Krauss (2002), building off of Oakes (1986), suggest that it is normal for students, academic researchers, and even methodology instructors to endorse incorrect interpretations of p-values and significance tests. That’s pretty bad. It’s one thing for first-year students or the lay public to be confused, but educated academics and methodology instructors too? If you don’t buy the survey results, open up any journal issue in any psychology journal and you’ll find tons of examples of misinterpretation and confusion.

Recently Hoekstra, Morey, Rouder, & Wagenmakers (2014) demonstrated that confidence intervals are similarly misinterpreted by researchers, despite recent calls (Cumming, 2014) to totally abandon significance tests in lieu of confidence intervals. Perhaps we could toss out the whole lot and start over with something that actually makes sense? Maybe we could try teaching something that people can actually understand?

I’ve heard of this cool thing called Bayesian statistics we could try.

#### References

Cumming, G. (2014). The new statistics: Why and how. *Psychological Science*, *25*(1), 7-29.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. *Methods of Psychological Research*, *7*(1), 1-20.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. *Psychonomic Bulletin & Review*, *21*(5), 1157-1164.

Oakes, M. W. (1986). *Statistical inference: A commentary for the social and behavioural sciences*. Wiley.

Tromovitch, P. (2015). The lay public’s misinterpretation of the meaning of ‘significant’: A call for simple yet significant changes in scientific reporting. *Journal of Research Practice*, *1*(1), 1.

One could argue that option 3 is incorrect as well. For what is meant with “the finding is different than would be expected by chance”? One would have to understand that this “difference” is measured by the integral over more extreme events that did not occur. In the words of Jeffreys (1980, p. 453), “I have always considered the arguments for the use of P absurd. They amount to saying that a hypothesis that may or may not be true is rejected because a greater departure from the trial value was improbable; that is, that it has not predicted something that has not happened.”

EJ Wagenmakers

Yep, that option is not exactly correct either, because findings that are significant are expected to occur by chance; it’s just that they’re not expected to happen too often.

And also none of the open ended responses mention data more extreme than observed. Well, maybe the one that said, “95 percentile” if you feel generous, but that’s still really vague.

I agree that neither the public nor most researchers, have a clear idea of what P values mean.

I don’t agree that it is either desirable,or even possible, to persuade experimentalists to use arguments that depend on subjective probabilities or arbitrary prior distributions.

Statisticians have failed to warn people that P values don’t measure what many people think they do. Their logic is impeccable, but the question that they answer is the wrong one, Statistics courses mostly lack any discussion of the false positive rate.

Fortunately it is possible to put a rough minimum on the false positive rate and that necessitates a change in the wording that’s used to describe P values. http://rsos.royalsocietypublishing.org/content/1/3/140216

David, thanks for sharing your paper here for others to read. I disagree with you that FDR is what all researchers want to know, and I do not think that you can even begin to prescribe what all researchers want to get out of their experiments. For some situations, the false-discovery rate will indeed be what they want and they can feel free to calculate it. Often, in psychology at least, they don’t want to know anything like that and so FDR is not any use to them.

They often want to know if the results of their study can be interpreted as evidence for their working hypothesis. That’s what I want in my experiments, and often that’s what I see people trying to get at when I read their experimental papers. To that end, competing relevant hypotheses can be instantiated as models (i.e., as prior distributions) and then the data can be interpreted as support for one or more model vs the others. The process is only as subjective as the theories that researchers decide to test.

I’ve already told you what the objective/subjective Bayes categorizations mean on twitter, and I don’t have the patience to go on about it with you at length again. “Subjective Bayes” simply means that models are informed by the specific theory they are meant to represent, rather than “Objective Bayes”, which uses generic theory-free models that have certain desired operating characteristics. Nowhere do personal whims fit into this.

And I imagine I won’t ever change your mind, so I’m happy to say I disagree with you and be on with it.

I have spent most of my life as an experimenter,at the hard end of the spectrum (single molecule biophysics). We have physical models for what’s happening, and we estimate physical quantities (rate constants, equilibrium constants). I don’t begin to understand what “models” mean in an empirical area like psychology. But judging by their record of irreproducibility, they are doing something wrong.

I have to say that language like “I’ve already told you what the objective/subjective Bayes categorizations mean on twitter” seems distinctly unhelpful. It would make me feel very stupid if it were not for the fact that I think most peoples’ views are closer to mine than to yours.

If the solution was as obvious as that.rather arrogant statement suggests, there would be little to discuss,

David, on twitter last month in our discussion of this, I shared a passage with you written by Bayesian statisticians that explicitly defined what Objective Bayes was. It was crystal clear. You replied, I quote, “we’ll have to agree to disagree then” and “I simply don’t believe that”.

If you will not believe the information given to you that has been taken from the horse’s mouth then there is no point in me trying to convince you.

As I hope I made clear, my point was simply that experimenters generally won’t tolerate subjective probabilities and guessed priors. The whole idea is to be as objective as possible. and that is why Bayesian arguments will never be accepted by natural scientists

We don’t seem to have any common ground here, and so as I said, “I’m happy to say I disagree with you and be on with it.”