No single index should substitute for scientific reasoning.

— Official ASA statement

*TLDR: The American Statistical Association’s officially stance is that p-values are bad measures of evidence. We as psychologists need to recalibrate our intuitions for what constitutes good evidence. See the full statement here. [Link fixed!]*

The American Statistical Association just released its long-promised official statement regarding its stance on p-values. If you don’t remember (don’t worry, it was over a year ago), the ASA responded to Basic and Applied Social Psychology’s (BASP) widely publicized p-value ban by saying,

A group of more than two-dozen distinguished statistical professionals is developing an ASA statement on p-values and inference that highlights the issues and competing viewpoints. The ASA encourages the editors of this journal [BASP] and others who might share their concerns to consider what is offered in the ASA statement to appear later this year and not discard the proper and appropriate use of statistical inference.

This development is especially relevant for psychologists, since the p-value is ubiquitous in our literature. I think I have only ever seen a handful of papers without one. Are we using it correctly? What is proper? The ASA is here to set us straight.

### The scope of the statement

The statement begins by saying “While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.” To help clarify how the p-value *should *be used, the ASA “believes that the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.” Their stated goal is to articulate “in non-technical terms a few select principles that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.”

### So first things first: what is a p-value?

The ASA gives the following definition for a p-value:

a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

So the p-value is a probability statement about the observed data, and data more extreme than those observed, given an underlying statistical model (e.g., a null hypothesis) is true. How can we use this probability measure?

### Six principles for using p-values

The basic gist of the statement is this: p-values *can* be used as a measure of the misfit between the data with a model (e.g., a null hypothesis), but that measure of misfit *does not *tell us the probability that the null hypothesis is true (as we all hopefully know by now). It *does not* tell us what action we should take — submit to a big name journal, abandon/continue a research line, implement an intervention, etc. It *does not* tell us how big or important the effect we’re studying is. And most importantly (in my opinion), it *does not* give us a meaningful measure of evidence regarding a model or hypothesis.

Here are the principles:

*P-values can indicate how incompatible the data are with a specified statistical model.**P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.**Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.**Proper inference requires full reporting and transparency.**A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.**By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.*

In the paper each principle is followed by a paragraph of detailed exposition. I recommend you take a look at the full statement.

### So what does this mean for psychologists?

The ASA gives many explicit recommendations and it is worth reading their full (short!) report. I think the most important principle is principle 6. Psychologists mainly use p-values as a measure of the evidence we have obtained against the null hypothesis. You run your study, check the p-value, if p is below .05 then you have “significant” evidence against the null hypothesis, and then you feel justified in doubting it and consequently having confidence in your preferred substantive hypothesis.

The ASA tells us this is not good practice. Taking a p-value as strong evidence just because it is below .05 is actually misleading; the ASA specifically says “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.” I recently discussed a paper on this blog (Berger & Delampady, 1987 [pdf]) that showed exactly this: A p-value near .05 can only achieve a *maximum* Bayes factor of ~2 with most acceptable priors, which is a very weak level of evidence — and usually it is much weaker still.

The bottom line is this: We need to adjust our intuitions about what constitutes adequate evidence. Joachim Vandekerckhove and I recently concluded that one big reason effects “failed to replicate” in the Reproducibility Project: Psychology is that the evidence for their existence was unacceptably weak to begin with. When we properly evaluate the evidence from the original studies (even before taking publication bias into account) we see there was little reason to believe the effects ever existed in the first place. “Failed” replications are a natural consequence of our current low standards of evidence.

There are many (many, many) papers in the statistics literature showing that p-values overstate the evidence against the null hypothesis; now the ASA have officially taken this stance as well.

### Choice quotes

Below I include some quotations think are most relevant to practicing psychologists.

Researchers should recognize that a p-value without context or other evidence provides limited information. For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible.

In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates

The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.

Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted and all p-values computed. Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting.

Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect.

Readers should keep in mind that a few people wrote this and not everyone agrees with all of it. The worst thing that could come of this is if it were taken as gospel. Who says the Bayes factor is the bedrock for measuring whether p-values overstate evidence? Answer: people who think Bayes factors are the standard for measuring evidence. Others think they are ratios with no firm meaning (2, 50, 100,500?) and aren’t at all comparable in different contexts. And unless you take the BF as a statistic and assess its error probabilities, there’s no error control (except possibly for predesignated point against point hypotheses).

Anyway, here are my comments.

http://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

[…] are not disregarded completely.[1]If you do not want to read the original statement, at least read Alexander Etz’ summary of it. jQuery("#footnote_plugin_tooltip_1").tooltip({ tip: "#footnote_plugin_tooltip_text_1", tipClass: […]

Alexander, glad to see this ASA position. How are other disciplines using p-values? Well, I am an economist, and (to my knowlegde), the situation in economic research is certainly not better than in psychology.

Look at this paper: http://www.dnb.nl/binaries/Working%20paper%20499_tcm46-337173.pdf

The Dutch Central Bank did research if clean couterfeit banknotes are easier to spot than dirty counterfeit banknotes.

They did experimental research. The design: invite “average citizens” and cashiers to a a laboratory, and give them the order to pick couterfeits out of stacks of clean and dirty banknotes. My questions:

– why such a setting? Why not have researchers pay in real life situations, using 4 types of banknotes: genuine and couterfeit, both clean and dirty ?

– why the distinction citizen – cashiers ? Yes I can make up a story behind it, but it seems random.

– the method and the results are so briefly described that I end up with more questions than answers. Raw data are not given.

Next, they do a regression analysis (huh?) using many independent variables:

– age of the respondent

– cashier yes or no

– nationality

– number of couterfeits in the stack

– sex

– education level

– visual handicap yes/ no

– preference for cash payments

– preference for credit card payments

– has checked banknotes in the past 6 months

And guess what? Some of these variables turn out to be significant at 0.95 or 0.99 !

It is no rocket science to achieve low p-values. There are countless variables that may influence the detection of couterfeits. Let me suggest a few:

– stress (is the respondent under stress)

– ovulation (why not use this classic one?)

– handicaps & diseases (wild guess: Parkinsons’ disease patients with shaking hands are not good at detecting couterfeits)

– intoxication levels of the respondents

– the sums were are talking about (€ 10,000 in cash will be checked more carefully than a single € 5 note)

– general level of criminality in the city/ area

– paranoia level of respondent

– geopolitical situation (hypohteses: in times of political tension, people check their banknotes more carefully.)

– distractions at the time of payment

– etc

– etc

Looks like a typical Garden of Forking Paths (see Gelman). This, plus unclear stopping rules in the research design, make that those p-values have very little meaning…… This is how statistics is done in the land of economists.

Thanks so much Pieter for this candid assessment of research in econ. I find it always interesting and fruitful to learn how other disciplines approach the same statistical issues. And from what you write, econ and psych share a lot of the same shortcomings (I am psychologist by training, BTW).

There is a paper by Simonsohn on SSNR, titled, False-positive economics, I think, where he discusses some of these issues.

The other day I was chatting with a theoretical physicist – and physics went through something similar that psychology is going through right now.

Particle physicists used to declare new particles when the p-value dropped below 3 sigma, but people realized that they had a lot of false positive discoveries. The reason for that was multiple testing (fishing really) – the physicist called this the “look-elsewhere effect”.

Their solution: ramp up the p-threshold to 6 sigma – and that’s what they have been sticking to ever since. You can actually see all of this nicely in the recent paper that was published on the discovery of gravitational waves.

> Next, they do a regression analysis (huh?) using many independent variables:

Yes, I recently worked on replicating an economic study that did much the same thing. The study is by economist Madeline Zavodny and it uses a p-value of 0.004 as evidence that “an additional 100 foreign-born workers in STEM fields with advanced degrees from US universities is associated with an additional 262 jobs among US natives”. The years for which Zavodny calculated this result was 2000 to 2007 and I was able to replicate this for the same years, getting a result of 263. This can be seen in the first row of Table 10 at http://econdataus.com/amerjobs.htm . In fact, if a truncation error is removed, the result becomes 293.4, shown in Table 11. However, Table 11 also shows that, if you move the time span forward 2 year to 2002-2009, the 293.4 gain becomes a 121.1 LOSS. Further, it appears that all but 4 of the 28 time spans of 3 or more years from 2002 to 2011 show a loss. Someone challenged me to look at the p-values and see if perhaps the result for 2000-2007 was much more significant than these results. In fact, all 66 of the time spans in the table are highly significant!

Apparently she tortured the numbers and they confessed 🙂 BTW, nice replication research.

I like that expression! Yes, I believe that numbers were “hurt” in the writing of that study! In fact, an article at http://www.nationalreview.com/article/426989/myth-h-1b-job-creation-michelle-malkin states:

> Zavodny’s study initially examined data from the years 2000 to 2010. She hypothesized that states with more foreign-born workers would have higher rates of employment among native-born Americans. Initially, she was unable to find a significant effect of foreign-born workers on U.S. jobs.

> So what changed? In correspondence with me and John Miano (the co-author of our new book, Sold Out, on the foreign-guest-worker racket), Zavodny revealed that when she showed her initial results to the study sponsor, the backers came up with the idea of discarding the last three years of data — ostensibly to eliminate the effects of the economic recession — and trying again.

> Voilà! After recrunching the numbers at the sponsor’s request, Zavodny found the effect the study sponsor was hoping to find.

Strangely, they didn’t seem to think of dropping 2000-2002 to eliminate the effects of the tech crash. Table 11 at http://econdataus.com/amerjobs.htm suggests that these years were key to the job gain result. That’s not surprising as the employment of both foreign and native workers declined sharply, showing a positive correlation between the two.

Reblogged this on POLITICAL PSYCHOLOGY and commented:

“The American Statistical Association gave an official statement about correct and appropriate uses of p values.

In short: We need to reevaluate our intuitions about what is an adequate level of statistical evidence”.

Alexander J. Etz