Question: Why do we settle for 80% power? Answer: We’re confused.

Coming back to the topic of my previous post, about how we must draw distinct conclusions from different hypothesis test procedures, I’d like to show an example of how these confusions might actually arise in practice. The following example comes from Royall’s book (you really should read it), and questions why we settle for a power of only 80%. It’s a question we’ve probably all thought about at some point. Isn’t 80% power just as arbitrary as p-value thresholds? And why should we settle for such a large probability of error before we even start an experiment?

From Royall (1997, pp. 109-110):

Why is a power of only 0.80 OK?

We begin with a mild peculiarity — why is it that the Type I error rate α is ordinarily required to be 0.05 or 0.01, but a Type II error rate as large as 0.20 is regularly adopted? This often occurs when the sample size for a clinical trial is being determined. In trials that compare a new treatment to an old one, the ‘null’ hypothesis usually states that the new treatment is not better than the old, while the alternative states that it is. The specific alternative value chosen might be suggested by pilot studies or uncontrolled trials that preceded the experiment that is now being planned, and the sample size is determined [by calculating power] with α = 0.05 and β = 0.20. Why is such a large value of β acceptable? Why the severe asymmetry in favor of α? Sometimes, of course, a Type I error would be much more costly than a Type II error would be (e.g. if the new treatment is much more expensive, or if it entails greater discomfort). But sometimes the opposite is true, and we never see studies proposed with α = 0.20 and β = 0.05. No one is satisfied to report that ‘the new treatment is statistically significantly better than the old (p ≤ 0.20)’.

Often the sample-size calculation is first made with β = α = 0.05. But in that case experimenters are usually quite disappointed to see what large values of n are required, especially in trials with binomial (success/failure) outcomes. They next set their sights a bit lower, with α = 0.05 and β = 0.10, and find that n is still ‘too large’. Finally they settle for α = 0.05 and β = 0.20.

Why do they not adjust α and settle for α = 0.20 and β = 0.05? Why is small α a non-negotiable demand, while small β is only a flexible desideratum? A large α would seem to be scientifically unacceptable, indicating a lack of rigor, while a large β is merely undesirable, an unfortunate but sometimes unavoidable consequence of the fact that observations are expensive or that subjects eligible for the trial are hard to find and recruit. We might have to live with a large β, but good science seems to demand that α be small.

What is happening is that the formal Neyman-Pearson machinery is being used, but it is being given a rejection-trial interpretation (Emphasis added). The quantities α and β are not just the respective probabilities of choosing one hypothesis when the other is true; if they were, then calling the first hypothesis H2 and the second H1 would reverse the roles of α and β, and α = 0.20, β = 0.05 would be just as satisfactory for the problem in its new formulation as α = 0.05 and β = 0.20 were in the old one. The asymmetry arises because the quantity α is being used in the dual roles that it plays in rejection trials — it is both the probability of rejecting a hypothesis when that hypothesis is true and the measure of strength of the evidence needed to justify rejection. Good science demands small α because small α is supposed to mean strong evidence. On the other hand, the Type II error probability β is being interpreted simply as the probability of failing to find strong evidence against H1 when the alternative H2 is true (Emphasis added. Recall Fisher’s quote about the impossibility of making Type II errors since we never accept the null.) … When observations are expensive or difficult to obtain we might indeed have to live with a large probability of failure to find strong evidence. In fact, when the expense or difficulty is extreme, we often decide not to do the experiment at all, thereby accpeting values of α = 0 and β = [1].

— End excerpt.

So there we have our confusion, which I alluded to in the previous post. We are imposing rejection-trial reasoning onto the Neyman-Pearson decision framework. We accept a huge β because we interpret our results as a mere failure (to produce strong enough evidence) to reject the null, when really our results imply a decision to accept the ‘null’. Remember, with NP we are always forced to choose between two hypotheses — we can never abstain from this choice because the respective rejection regions for H1 and H2 encompass the entire sample space by definition; that is, any result obtained must fall into one of the rejection regions we’ve defined. We can adjust either α or β (before starting the experiment) as we see fit, based on the relative costs of these errors. Since neither hypothesis is inherently special, adjusting α is as justified as adjusting β and neither has any bearing on the strength of evidence from our experiment.

And surely it doesn’t matter which hypothesis is defined as the null, because then we would just switch the respective α and β — that is, H1 and H2 can be reversed without any penalty in the NP framework. Who cares which hypothesis gets the label 1 or 2?

But imagine the outrage (and snarky blog posts) if we tried swapping out the null hypothesis with our pet hypothesis in a rejection trial. Would anybody buy it if we tried to accept our pet hypothesis simply based on a failure to reject it? Of course not, because that would be absurd. Failing to find strong evidence against a single hypothesis has no logical implication that we have found evidence for that hypothesis. Fisher was right about this one. And this is yet another reason NP procedures and rejection trials don’t mix.

However, when we are using concepts of power and Type II errors, we are working with NP procedures which are completely symmetrical and have no concept of strength of evidence per se. Failure to reject the null hypothesis has the exact same meaning as accepting the null hypothesis — they are simply different ways to say the same thing.  If what you want is to measure evidence, fine; I think we should be measuring evidence in any case. But then you don’t have a relevant concept of power, as Fisher has reiterated time and time again. If you want to use power to help plan experiments (as seems to be recommended just about everywhere you look) then you must cast aside your intuitions about interpreting observations from that experiment as evidence. You must reject the rejection trial and reject notions of statistical evidence. 

Or don’t, but then you’re swimming in a sea of confusion.

 

References

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Are all significance tests made of the same stuff?

No! If you are like most of the sane researchers out there, you don’t spend your days and nights worrying about the nuances of different statistical concepts. Especially ones as traditional as these. But there is one concept that I think we should all be aware of: P-values mean very different things to different people. Richard Royall (1997, p. 76-7) provides a smattering of different possible interpretations and fleshes out the arguments for why these mixed interpretations are problematic (much of this post comes from his book):

In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. (Daniel, 1991, p. 192)

A nonsignificant result does not prove that the null hypothesis is correct — merely that it is tenable — our data do not give adequate grounds for rejecting it. (Snedecor and Cochran, 1980, p. 66)

The verdict does not depend on how much more readily some other hypothesis would explain the data. We do not even start to take that question seriously until we have rejected the null hypothesis. …..The statistical significance level is a statement about evidence… If it is small enough, say p = 0.001, we infer that the result is not readily explained as a chance outcome if the null hypothesis is true and we start to look for an alternative explanation with considerable assurance. (Murphy, 1985, p. 120)

If [the p-value] is small, we have two explanations — a rare event has happened, or the assumed distribution is wrong. This is the essence of the significance test argument. Not to reject the null hypothesis … means only that it is accepted for the moment on a provisional basis. (Watson, 1983)

Test of hypothesis. A procedure whereby the truth or falseness of the tested hypothesis is investigated by examining a value of the test statistic computed from a sample and then deciding to reject or accept the tested hypothesis according to whether the value falls into the critical region or acceptance region, respectively. (Remington and Schork, 1970, p. 200)

Although a ‘significant’ departure provides some degree of evidence against a null hypothesis, it is important to realize that a ‘nonsignificant’ departure does not provide positive evidence in favour of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis. (Armitage and Berry, 1987, p. 96)

If that value [of the test statistic] is in the region of rejection, the decision is to reject H0; if that value is outside the region of rejection, the decision is that H0 cannot be rejected at the chosen level of significance … The reasoning behind this decision process is very simple. If the probability associated with the occurance under the null hypothesis of a particular value in the sampling distribution is very small, we may explain the actual occurrence of that value in two ways; first we may explain it by deciding that the null hypothesis is false or, second, we may explain it by deciding that a rare and unlikely event has occurred. (Siegel and Castellan, 1988, Chapter 2)

These all mix and match three distinct viewpoints with regard to hypothesis tests: 1) Neyman-Pearson decision procedures, 2) Fisher’s p-value significance tests, and 3) Fisher’s rejection trials (I think 2 and 3 are sufficiently different to be considered separately). Mixing and matching them is inappropriate, as will be shown below. Unfortunately, they all use the same terms so this can get confusing! I’ll do my best to keep things simple.

1. Neyman-Pearson (NP) decision procedure:
Neyman describes it thusly:

The problem of testing a statistical hypothesis occurs when circumstances force us to make a choice between two courses of action: either take step A or take step B… (Neyman 1950, p. 258)

…any rule R prescribing that we take action A when the sample point … falls within a specified category of points, and that we take action B in all other cases, is a test of a statistical hypothesis. (Neyman 1950, p. 258)

The terms ‘accepting’ and ‘rejecting’ a statistical hypothesis are very convenient and well established. It is important, however, to keep their exact meaning in mind and to discard various additional implications which may be suggested by intuition. Thus, to accept a hypothesis H means only to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true. Also if the application … ‘rejects’ H, this means only that the rule prescribes action B and does not imply that we believe that H is false. (Neyman 1950, p. 259)

So what do we take from this? NP testing is about making a decision to choose H0 or H1, not about shedding light on the truth of any one hypothesis or another. We calculate a test statistic, see where it lies with regard to our predefined rejection regions, and make the corresponding decision. We can assure that we are not often wrong by defining Type I and Type II error probabilities (α and β) to be used in our decision procedure. According to this framework, a good test is one that minimizes these long-run error probabilities. It is important to note that this procedure cannot tell us anything about the truth of hypotheses and does not provide us with a measure of evidence of any kind, only a decision to be made according to our criteria. This procedure is notably symmetric — that is, we can either choose H0 or H1.

Test results would look like this:

α and β were prespecified -based on relevant costs associated with the different errors- for this situation at yadda yadda yadda. The test statistic (say, t=2.5) falls inside the rejection region for H0 defined as t>2.0 so we reject H0 and accept H1.” (Alternatively, you might see “p < α = x so we reject H0. The exact value of p is irrelevant, it is either inside or outside of the rejection region defined by α. Obtaining a p = .04 is effectively equivalent to p = .001 for this procedure, as is obtaining a result very much larger than the critical t above.)

2. Fisher’s p-value significance tests 

Fisher’s first procedure is only ever concerned with one hypothesis- that being the null. This procedure is not concerned with making decisions (and when in science do we actually ever do that anyway?) but with measuring evidence against the hypothesis. We want to evaluate ‘the strength of evidence against the hypothesis’ (Fisher, 1958, p.80) by evaluating how rare our particular result (or even bigger results) would be if there were really no effect in the study. Our objective here is to calculate a single number that Fisher called the level of significance, or the p-value. Smaller p is more evidence against the hypothesis than larger p. Increasing levels of significance* are often represented** by more asterisks*** in tables or graphs. More asterisks mean lower p-values, and presumably more evidence against the null.

What is the rationale behind this test? There are only two possible interpretations of our low p: either a rare event has occurred, or the underlying hypothesis is false. Fisher doesn’t think the former is reasonable, so we should assume the latter (Bakan, 1966).

Note that this procedure is directly trying to measure the truth value of a hypothesis. Lower ps indicate more evidence against the hypothesis. This is based on the Law of Improbability, that is,

Law of Improbability: If hypothesis A implies that the probability that a random variable X takes on the value x is quite small, say p(x), then the observation X = x is evidence against A, and the smaller p(x), the stronger the evidence. (Royall, 1997, p. 65)

In a future post I will attempt to show why this law is not a valid indicator of evidence. For the purpose of this post we just need to understand the logic behind this test and that it is fundamentally different from NP procedures. This test alone does not provide any guidance with regard to taking action or making a decision, it is intended as a measure of evidence against a hypothesis.

Test results would look like this:

The present results obtain a t value of 2.5, which corresponds to an observed p = .01**. This level of significance is very small and indicates quite strong evidence against the hypothesis of no difference.

3. Fisher’s rejection trials

This is a strange twist on both of the other procedures above, taking elements from each to form a rejection trial. This test is a decision procedure, much like NP procedures, but with only one explicitly defined hypothesis, a la p-value significance tests. The test is most like what psychologists actually use today, framed as two possible decisions, again like NP, but now they are framed in terms of only one hypothesis. Rejection regions are back too, defined as a region of values that have small probability under H0 (i.e., defined by a small α). It is framed as a problem of logic, specifically,

…a process analogous to testing a proposition in formal logic via the argument known as modus tollens, or ‘denying the consequent’: if A implies B, then not-B implies not-A. We can test A by determining whether B is true. If B is false, then we conclude that A is false. But, on the other hand, if B is found to be true we cannot conclude that A is true. That is, A can be proven false by such a test but it cannot be proven true — either we disprove A or we fail to disprove it…. When B is found to be true, so that A survives the test, this result, although not proving A, does seem intuitively to be evidence supporting A. (Royall, 1997, p. 72)

An important caveat is that these tests are probabilistic in nature, so the logical implications aren’t quite right. Nevertheless, rejection trials are what Fisher referred to when he famously said,

Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis… The notion of an error of the so-called ‘second kind,’ due to accepting the null hypothesis ‘when it is false’ … has no meaning with reference to simple tests of significance. (Fisher, 1966)

So there is a major difference from NP — With rejection trials you have a single hypothesis (as opposed to 2) combined with decision rules of “reject the H0 or do not reject H0” (as opposed to reject H0/H1 or accept H0/H1). With rejection trials we are back to making a decision. This test is asymmetric (as opposed to NP which is symmetric) — that is, we can only ever reject H0, never accept it.

While we are making decisions with rejection trials, the decisions have a different meaning than that of NP procedures. In this framework, deciding to reject H0 implies the hypothesis is “inconsistent with the data” or that the data “provide sufficient evidence to cause rejection” of the hypothesis (Royall, 1997, p.74). So rejection trials are intended to be both decision procedures and measures of evidence. Test statistics that fall into smaller α regions are considered stronger evidence, much the same way that a smaller p-value indicates more evidence against the hypothesis. For NP procedures α is simply a property of the test, and choosing a lower one has no evidential meaning per se (although see Mayo, 1996 for a 4th significance procedure — severity testing).

Test results would look like this:

The present results obtain a t = 2.5, p = .01, which is sufficiently strong evidence against H0 to warrant its rejection.

What is the takeaway?

If you aren’t aware of the difference between the three types of hypothesis testing procedures, you’ll find yourself jumbling them all up (Gigerenzer, 2004). If you aren’t careful, you may end up thinking you have a measure of evidence when you actually have a guide to action.

Which one is correct?

Funny enough, I don’t endorse any of them. I contend that p-values never measure evidence (in either p-value procedures or rejection trials) and NP procedures lead to absurdities that I can’t in good faith accept while simultaneously endorsing them.

Why write 2000 words clarifying the nuanced differences between three procedures I think are patently worthless? Well, did you see what I said at the top referring to sane researchers?

A future post is coming that will explicate the criticisms of each procedure, many of the points again coming from Royall’s book.

References

Armitage, P., & Berry, G. (1987). Statistical methods in medical research. Oxford: Blackwell Scientific.

Bakan, D. (1966). The test of significance in psychological research.Psychological bulletin, 66(6), 423.

Daniel, W. W. (1991). Hypothesis testing. Biostatistics: a foundation for analysis in the health sciences5, 191.

Fisher, R. A. (1958).Statistical methods for research workers (13th ed.). New York: Hafner.

Fisher, R. A. (1966). The design of experiments (8th edn.) Oliver and Boyd.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics,33(5), 587-606.

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Murphy, E. A. (1985). A companion to medical statistics. Johns Hopkins University Press.

Neyman, J. (1950). First course in probability and statistic. Published by Henry Holt, 1950.,1.

Remington, R. D., & Schork, M. A. (1970). Statistics with applications to the biological and health sciences.

Royall, R. (1997). Statistical evidence: a likelihood paradigm (Vol. 71). CRC press.

Siegel, S. C., & Castellan, J. NJ (1988). Nonparametric statistics for the behavioural sciences. New York, McGraw-Hill.

Snedecor, G. W. WG Cochran. 1980. Statistical Methods. Iowa State Univ. Press, Ames.

Watson, G. S. (1983). Hypothesis testing. Encyclopedia of Statistics in Quality and Reliability.

Should we buy what Greg Francis is selling? (Nope)

If you polled 100 scientists at your next conference with the single question, “Is there publication bias in your field?” I would predict nearly 100% respondents to reply “Yes.” How do they know? Did they need to read about a thorough investigation of many journals to come to that conclusion? No, they know because they have all experienced publication bias firsthand.

Until recently, researchers had scant opportunity to publish their experiments that didn’t “work” (and most times they still can’t, but now at least they can share them online unpublished). Anyone who has tried to publish a result in which all of their main findings were not “significant,” or who has had a reviewer ask them to collect more subjects in order to lower their p-value (a big no-no), or who neglect to submit to a conference when the results were null, or who have seen colleagues tweak and re-run experiments that failed to reach significance only to stop when one does, knows publication bias exists. They know that if they don’t have a uniformly “positive” result then it won’t be taken seriously. The basic reality is this: If you do research in any serious capacity, you have experienced (and probably contributed to) publication bias in your field. 

Greg Francis thinks that we should be able to point out certain research topics or journals (that we already know to be biased toward positive results) and confirm that they are biased- using the Test of Excess Significance. This is a method developed by Ioannidis and Trikalinos (2007). The logic of the test is that of a traditional null-hypothesis test, and I’ll quote from Francis’s latest paper published in PLOS One (Francis et al., 2014):

We start by supposing proper data collection and analysis for each experiment along with full reporting of all experimental outcomes related to the theoretical ideas. Such suppositions are similar to the null hypothesis in standard hypothesis testing. We then identify the magnitude of the reported effects and estimate the probability of success for experiments like those reported. Finally, we compute a joint success probability Ptes, across the full set of experiments, which estimates the probability that experiments like the ones reported would produce outcomes at least as successful as those actually reported. … The Ptes value plays a role similar to the P value in standard hypothesis testing, with a small Ptes suggesting that the starting suppositions are not entirely correct and that, instead, there appears to be a problems with data collection, analysis, or publication of relevant findings. In essence, if Ptes is small, then the published findings … appear “too good to be true” (pg. 3).

So it is a basic null-hypothesis significance test. I personally don’t see the point of this test since we already know with certainty that the answer to the question, “Is there publication bias in this topic?” is unequivocally “Yes.” So every case that the test finds not to be biased is a false-negative. But as Daniel Lakens said, “anyone is free to try anything in science,” a sentiment with which I agree wholeheartedly. And I would be a real hypocrite if I thought Francis shouldn’t share his new favorite method even if it turns out it really doesn’t work very well. But if he is going to continue to apply this test and actually name authors who he thinks are engaging specific questionable publishing practices, then he should at the very least include a “limitations of this method” section in every paper, wherein he at least cites his critics. He should also at least ask the original authors he is investigating for comments, since the original authors are the only ones who know the true state of their publication process. I am surprised that the reviewers and editor of this manuscript did not stop and ask themselves (or Francis), “It can’t be so cut and dried, can it?”

Why the Test for Excess Significance does not work

So on to the fun stuff. There are many reasons why this test cannot achieve its intended goals, and many reasons why we should take Francis’s claims with a grain of salt. This list is not at all arranged in order of importance, but in order of his critics listed in the JMP special issue (excluding Ioannidis and Gelman because of space and relevance concerns). I selected the points that I think most clearly highlight the poor validity of this testing procedure. This list gets long, so you can skip to the Conclusion (tl;dr) below for a summary.

Vandekerckhove, Guan, Styrcula, 2013

  1. Using Monte Carlo simulations, Vandekerckhove and colleagues show that when used to censor studies that seem too good to be true in a 100% publication biased environment, the test censors almost nothing and the pooled effect size estimates remain as biased as before correction.
  2. Francis uses a conservative cutoff of .10 when he declares that a set of studies suffers from systematic bias. Vandekerckhove and colleagues simulate how estimates of pooled effect size change if we make the test more conservative by using a cutoff of .80. This has the counter-intuitive effect of increasing the bias in the pooled effect size estimate. In the words of the authors, “Perversely, censoring all but the most consistent-seeming papers … causes greater bias in the effect size estimate” (Italics original).
  3. Bottom line: This test cannot be used to adjust pooled effect size estimates by accounting for publication bias.

Simonsohn, 2013

  1. Francis acknowledges that there can be times when the test returns a significant result when publication bias is small. Indeed, there is no way to distinguish between different amounts of publication bias by comparing different Ptes values (remember the rules of comparing p-values). Francis nevertheless argues that we should assume any significant Ptes result to indicate an important level of publication bias. Repeat after me: Statistically significant ≠ practically significant. The fact of the matter is, “the mere presence of publication bias does not imply it is consequential” and by extension “does not warrant fully ignoring the underlying data” (Italics original). Francis continues to ignore these facts. [as an aside; If he can come up with a way to quantify the amount of bias in an article (and not just state bias is present) then maybe the method could be taken seriously.]
  2. Francis’s critiques themselves suffer from publication bias, invalidating the reported Ptes-values. While Francis believes this is not relevant because he is critiquing unrelated studies, they are related enough to be written up and published together. While the original topics may indeed be unrelated, “The critiques by Francis, by contrast, are by the same author, published in the same year, conducting the same statistical test, to examine the exact same question.” Hardly unrelated, it would seem.
  3.  If Francis can claim that his reported p-values are accurate because the underlying studies are unrelated, then so too can the original authors. Most reports with multiple studies test effects under different conditions or with different moderators. It goes both ways.

Johnson, 2013 (pdf hosted with permission of the author)

  1. Johnson begins by expressing how he feels being asked to comment on this method: “It is almost as if all parties involved are pretending that p-values reported in the psychological literature have some well-defined meaning and that our goal is to ferret out the few anomalies that have somehow misrepresented a type I error. Nothing, of course, could be farther from the truth.” The “truth is this: as normally reported, p-values and significance tests provide the consumer of these statistics absolutely no protection against rejecting “true” null hypotheses at less than any specified rate smaller than 1.0. P-values … only provide the experimenter with such a protection … if she behaves in a scientifically principled way” (Italics added). So Johnson rejects the premise that the test of excess significance is evaluating a meaningful question at all.
  2. This test uses a nominal alpha of .10, quite conservative for most classic statistical tests. Francis’s simulations show, however, that (when assumptions are met and under ideal conditions) the actual type I error rate is far, far lower than the nominal level. This introduces questions of interpretability: How do we interpret the alpha level under different (non-ideal) conditions if the nominal alpha is not informative? Could we adjust it to reflect its actual alpha level? Probably not.
  3. This test is not straightforward to implement, and one must be knowledgeable about the research question in the paper being investigated and which statistics are relevant to that question. Francis’s application to the Topolinski and Sparenberg (2012) article, for example, is wrought with possible researcher degrees of freedom regarding which test statistics he includes in his analysis.
  4. If researchers report multiple statistical tests based on the same underlying data, the assumption of independence is violated to an unknown degree, and the reported Ptes-values could range from barely altered at best, to completely invalidated at worst. Heterogeneity of statistical power for tests that are independent also invalidates the resulting Ptes-values, and his method has no way to account for power heterogeneity.
  5. There is no way to evaluate his sampling process, which is vital in evaluating any p-value (including Ptes). How did he come to analyze this paper, or this journal, or this research topic? How many did he look at before he decided to look at this particular one? Without this knowledge we cannot assess the validity of his reported Ptes-values.

Morey, 2013

  1. Bias is a property of a process, not any individual sample. To see this, Morey asks us to imagine that we ask people to generate “random” sequences of 0s and 1s. We know that humans are biased when they do this, and typically alternate 0 and 1 too often. Say we have the sequence 011101000. This shows 4 alternations, exactly as many we would expect from a random process (50%, or 4/8). If we know a human generated this sequence, then regardless of the fact that it conforms perfectly to a random sequence, it is still biased. Humans are biased regardless of the sequence they produce. Publication processes are biased regardless of the bias level in studies they produce. Asking which journals or papers or topics show bias is asking the wrong question. We should ask if the publication process is biased, the answer to which we already know is “Yes.” We should focus on changing the process, not singling papers/topics/journals that we already know come from a biased process.
  2. The test assumes a fixed sample size (as does almost every p-value), but most researchers run studies sequentially. Most sets of studies are a result of getting a particular result, tweaking the protocol, getting another result, and repeat until satisfied or out of money/time. We know that p-values are not valid when the sample size is not fixed in advance, and this holds for Francis’s Ptes all the same. It is probably not possible to adjust the test to account for the sequential nature of real world studies, although I would be interested to see a proof.
  3. The test equates violations of the binomial assumption with the presence of publication bias, which is just silly. Imagine we use the test in a scenario like above (sequential testing) where we know the assumption is violated but we know that all relevant experiments for this paper are published (say, we are the authors). We could reject the (irrelevant) null hypothesis when we can be sure that the study suffers from no publication bias. Further, through simulation Morey shows that when true power is .4 or less, “examining experiment sets of 5 or greater will always lead to a significant result [Ptes-value], even when there is no publication bias” (Italics original).
  4. Ptes suffers from all of the limitations of p-values, chief of which are that different p-values are not comparable and p is not an effect size (or a measure of evidence at all). Any criticisms of p-values and their interpretation (of which there are too many to list) apply to Ptes.

Conclusions (tl;dr)

The test of excess significance suffers from many problems, ranging from answering the wrong questions about bias, to untenable assumptions, to poor performance in correcting effect size estimates for bias, to challenges of interpreting significant Ptes-values. Francis published a rejoinder in which he tries to address these concerns, but I find his rebuttal lacking. For space constraints (this is super long already) I won’t list the points in his reply but I encourage you to read it if you are interested in this method. He disagrees with pretty much every point I’ve listed above, and often claims they are addressing the wrong questions. I contend that he falls into the same trap he warns others to avoid in his rejoinder, that is, “[the significance test can be] inappropriate because the data do not follow the assumptions of the analysis. … As many statisticians have emphasized, scientists need to look at their data and not just blindly apply significance tests.” I completely agree.

Edits: 12/7 correct mistake in Morey summary. 12/8 add links to reviewed commentaries.

References

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169.

Francis, G. (2013). We should focus on the biases that matter: A reply to commentaries. Journal of Mathematical Psychology, 57(5), 190-195.

Francis G, Tanzman J, Matthews WJ (2014) Excess Success for Psychology Articles in the Journal Science. PLoS ONE 9(12): e114255. doi:10.1371/journal.pone.0114255

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328-331.

Ioannidis, J. P., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245-253.

Johnson, V. E. (2013). On biases in assessing replicability, statistical consistency and publication bias. Journal of Mathematical Psychology, 57(5), 177-179.

Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology,57(5), 180-183.

Simonsohn, U. (2013). It really just does not follow, comments on. Journal of Mathematical Psychology, 57(5), 174-176.

Vandekerckhove, J., Guan, M., & Styrcula, S. A. (2013). The consistency test may be too weak to be useful: Its systematic application would not improve effect size estimation in meta-analyses. Journal of Mathematical Psychology,57(5), 170-173.

Can confidence intervals save psychology? Part 2

This is part 2 in a series about confidence intervals (here’s part 1). Answering the question in the title is not really my goal, but simply to discuss confidence intervals and their pros and cons. The last post explained why frequency statistics (and confidence intervals) can’t assign probabilities to one-time events, but always refer to a collective of long-run events.

If confidence intervals don’t really tell us what we want to know, does that mean we should throw them in the dumpster along with our p-values? No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p. Eventually we may want to drop CIs for more nuanced inference, but for the time being we would do much better with this simple switch.

If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean). Can we ever know which ones? Tragically, no. But some would feel pretty good about the process being used if it only has a 5% life-time error rate. One could achieve a lower error rate by stretching the intervals (to say, 99%) but that would leave them too embarrassingly wide for most.

If we use p we will be wrong 5% of the time in the long-run when we are testing a true null-hypothesis (i.e., no association between variables, or no difference between means, etc., and assuming the analysis is 100% pre-planned). But when we are testing a false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run (Button et al., 2013; Cohen, 1962; Sedlmeier & Gigerenzer, 1989). If you are one of the many who do not believe a null-hypothesis can actually be true, then we are always in the latter scenario with that huge error rate. In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result “significant” whenever it lands heads. 

There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? Yes, unfortunately they do (Coulson et al., 2010).

References

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but don’t guarantee, better inference than statistical significance testing.Frontiers in psychology, 1, 26.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105(2), 309.

http://datacolada.org/2014/10/08/28-confidence-intervals-dont-change-how-we-think-about-data/

Can confidence intervals save psychology? Part 1

Maybe, but probably not by themselves. This post was inspired by Christian Jarrett‘s recent post (you should go read it if you missed it), and the resulting twitter discussion. This will likely develop into a series of posts on confidence intervals.

Geoff Cumming is a big proponent of replacing all hypothesis testing with CI reporting. He says we should change the goal to be precise estimation of effects using confidence intervals, with a goal of facilitating future meta-analyses. But do we understand confidence intervals? (More estimation is something I can get behind, but I think there is still room for hypothesis testing.)

In the twitter discussion, Ryne commented, “If 95% of my CIs contain Mu, then there is .95 prob this one does [emphasis mine]. How is that wrong?” It’s wrong for the same reason Bayesian advocates dislike frequency statistics- You cannot assign probabilities to single events or parameters in that framework. The .95 probability is a property of the process of creating CIs in the long-run, it is not associated with any given interval. That means you cannot make any probabilistic claims about this interval containing Mu, or otherwise, this particular hypothesis being true

In the frequency statistics framework, all probabilities are long-run frequencies (i.e., a proportion of times an outcome occurs out of all possible related outcomes). As such, all statements about associated probabilities must be of that nature. If a fair coin has an associated probability of 50% heads, and I flip a fair coin very many times, then in the long-run I will obtain half heads and half tails. In any given next flip there is no associated probability of heads. This flip is either heads (p(H) = 1) or tails (p(H) = 0) and we don’t know which until after we flip.¹ By assigning probabilities to single events the sense of a long-run frequency is lost (i.e., one flip is not a collective of all flips). As von Mises puts it:

Our probability theory [frequency statistics] has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?” (von Mises, 1957, p. 9, quoted in Oakes, 1986, p. 16)

This is why Ryne’s statement was wrong, and this is why there can be no statements of the kind, “X is the probability that these results are due to chance,”² or “There is a 50% chance that the next flip will be heads,” or “This hypothesis is probably false,” when one adopts the frequency statistics framework. All probabilities are long-run frequencies in a relevant “collective.” (Have I beaten this horse to death yet?) It’s counter-intuitive and strange that we cannot speak of any single event or parameter’s probability. But sadly we can’t in this framework, and as such, “There is .95 probability that Mu is captured by this CI,” is a vacuous statement. If you want to assign probabilities to single events and parameters come join us over in Bayesianland (we have cookies).

EDIT 11/17: See Ryne’s post for why he rejects the technical definition for a pragmatic definition.

Notes:

¹But don’t tell Daryl Bem that.

²Often a confused interpretation of the p-value. The correct interpretation is subtly different: “The probability of the obtained (or more extreme) results given chance.” “Given” is the key difference, because here you are assuming chance. How can an analysis assuming chance is true (i.e., p(chance) = 1) lead to a probability statement about chance being false?

References:

Cumming, G. (2013). The new statistics why and how. Psychological science, 0956797613504966.

Oakes, M. W. (1986). Statistical inference: A commentary for the social and behavioural sciences. New York: Wiley.

The Special One-Way ANOVA (or, Shutting up Reviewer #2)

The One-Way Analysis of Variance (ANOVA) is a handy procedure that is commonly used when a researcher has three or more groups that they want to compare. If the test comes up significant, follow-up tests are run to determine which groups show meaningful differences. These follow-up tests are often corrected for multiple comparisons (the Bonferroni method is most common in my experience), dividing the nominal alpha (usually .05) by the number of tests. So if there are 5 follow up tests, each comparison’s p-value must be below .01 to really “count” as significant. This reduces the test’s power considerably, but better guards against false-positives. It is common to correct all follow-up tests after a significant main effect, no matter the experimental design, but this is unnecessary when there are only three levels. H/T to Mike Aitken Deakin (here: @mrfaitkendeakin) and  Chris Chambers (here: @chrisdc77) for sharing.

The Logic of the Uncorrected Test

In the case of the One-Way ANOVA with three levels, it is not necessary to correct for the extra t-tests because the experimental design ensures that the family-wise error rate will necessarily stay at 5% — so long as no follow-up tests are carried out when the overall ANOVA is not significant.

A family-wise error rate (FWER) is the allowed tolerance for making at least 1 erroneous rejection of the null-hypothesis in a set of tests. If we make 2, 3, or even 4 erroneous rejections, it isn’t considered any worse than 1. Whether or not this makes sense is for another blog post. But taking this definition, we can think through the scenarios (outlined in Chris’s tweet) and see why no corrections are needed:

True relationship: µ1 = µ2 = µ3 (null-hypothesis is really true, all groups equal). If the main effect is not significant, no follow-up tests are run and the FWER remains at 5%. (If you run follow-up tests at this point you do need to correct for multiple comparisons.) If the main effect is significant, it does not matter what the follow-up tests show because we have already committed our allotted false-positive. In other words, we’ve already made the higher order mistake of saying that some differences are present before we even examine the individual group contrasts. Again, the FWER accounts for making at least 1 erroneous rejection. So no matter what our follow-up tests show, the FWER remains at 5% since we have already made our first false-positive before even conducting the follow-ups.

True relationship: µ1 ≠ µ2 = µ3, OR µ1 = µ2 ≠ µ3, OR µ1 ≠ µ3 = µ2  (null-hypothesis is really false, one group stands out). If the main effect is significant then we are correct, and no false-positive is possible at this level. We go with our follow-up tests (where it is really true that one group is different from the other two), where only one pair of means is truly equal. So that single pair is the only place for a possible false-positive result. Again, our FWER remains at 5% because we only have 1 opportunity to erroneously reject a null-hypothesis.

True relationship: µ1 ≠ µ2 ≠ µ3. A false-positive is impossible in this case because all three groups are truly different. All follow-up tests necessarily keep the FWER at 0%!

There is no possible scenario where your FWER goes above 5%, so no need to correct for multiple comparisons! 

So the next time Reviewer #2 gives you a hard time about correcting for multiple comparisons on a One-Way ANOVA with three levels, you can rightfully defend your uncorrected t-tests. Not correcting the alpha saves you some power, thereby making it easier to support your interesting findings.

If you wanted to sidestep the multiple comparison problem altogether you could do a fully Bayesian analysis, in which the number of tests conducted holds no weight on the evidence of a single test. So in other words, you could jump straight to the comparisons of interest instead of doing the significant main effect → follow-up test routine. Wouldn’t that save us all a lot of hassle?

 

The Broken Ratchet

In a recent paper, Tennie and colleagues provide new data with regard to the concept of cumulative cultural learning. They set out to find evidence for a cultural “ratchet”, a mechanism by which one secures advantageous behavior seen in others, while simultaneously improving the behavior to become more efficient/productive. This is most commonly done through diffusion chains, as is done here. The authors rounded up 80 four year olds (40 male, 40 female) and sorted them into chains of 5 kids each; leaving them with eight male and eight female chains. What follows is what I took away from this paper.

The kids’ task was simple: Try to fill a bucket with as much dry rice as possible. Two kids would be in the room at a time. Kids who completed their turn would swap out for kids who were new to the task, so that there was always 1 kid filling the bucket and 1 kid watching. The kids were given different tools they could potentially use (see their figure 1 below). Some tools were obviously better than others, carrying capacities: Bowl – 817.5g, Bucket – 439.7g, Scoop – 63.9g, Cardboard – 21.5g. In half of the chains, the first child saw an experimenter use the worst tool of the bunch (flimsy cardboard, circled in the figure) and the other half didn’t get a demonstration at all. As always, you can click on the figures to enlarge them.

Image As the authors said, “A main question of interest was whether children copied [Experimenter]’s and/or the previous child’s choice of tool or whether they innovated by introducing new tools”. In other words, evidence for a ratchet effect would manifest in later generations using more productive tools than the earlier generations. Another interest is whether this innovation differed between conditions- those that had an experimenter demonstrate or not. Not sure why this manipulation is interesting, seeing as the only kids who see the experimenter perform the task are in Generation 1.

ImageWithout even going into the stats, I don’t see much evidence that kids are ratcheting. Most chains in the baseline show the following pattern: Generation 1 uses tool X and all subsequent generations use tool X. Two chains manage to break the imitation spell, both switching from scoop+bucket to scoop+bowl. The experimental group shows a similar pattern, where the kids either all copy generation 1 (who copied the experimenter) or one adventurous kid in the chain decides to switch tools and the rest copy him/her. Interestingly, the chains in the experimental group only ever switched from the cardboard to the scoop, effectively going from the worst tool to the second-worst tool. If these kids were trying to score the most rice, wouldn’t it be best to switch to the bucket or the bowl? Weird.

The authors propose that kids in baseline didn’t innovate across generations because they were already performing at a high level in generation 1, so they didn’t have room to grow. Well, the only chains who did actually innovate in baseline started with scoop + bucket (second highest capacity tool) and went to scoop + bowl (highest capacity tool). Further, the chains in the lowest starting position, scoop only, never innovated.

Overall I thought the experiment was cool. Rounding up 80 four year olds is not to be scoffed at. But I don’t agree with their claim that the baseline group was at ceiling and I don’t see much ratcheting in the experimental group (who all start with the worst tool).