This is equivalent to saying that if the application of a principle to given evidence leads to an absurdity then the evidence must be discarded. It is reminiscent of the heavy smoker, who, worried by the literature relating smoking to lung cancer, decided to give up reading.
— Cornfield, 1966 (pdf link)
The next steps series intro:
After the great response to the eight easy steps paper we posted, I have decided to start a recurring series, where each week I highlight one of the papers that we included in the appendix of the paper. The format will be short and simple: I will give a quick summary of the paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere. At the end of the post I will list a few suggestions for the next entry, so vote in the comments or on twitter (@alxetz) for which one you’d like next.
Sequential trials, sequential analysis and the likelihood principle
Theoretical focus, low difficulty
Cornfield (1966) begins by posing a question:
Do the conclusions to be drawn from any set of data depend only on the data or do they depend also on the stopping rule which led to the data? (p. 18)
The purpose of his paper is to discuss this question and explore the implications of answering “yes” versus “no.” This paper is a natural followup to entries one and three in the eight easy steps paper.
If you have read the eight easy steps paper (or at least the first and third steps), you’ll know that the answer to the above question for classical statistics is “yes”, while the answer for Bayesian statistics is “no.”
Cornfield introduces a concepts he calls the “α-postulate,” which states,
All hypotheses rejected at the same critical level [i.e., p<.05] have equal amounts of evidence against them. (p. 19)
Through a series of examples, Cornfield shows that the α-postulate appears to be false.
Cornfield then introduces a concept called the likelihood principle, which comes up in a few of the eight easy steps entries. The likelihood principle says that the likelihood function contains all of the information relevant to the evaluation of statistical evidence. Other facets of the data that do not factor into the likelihood function are irrelevant to the evaluation of the strength of the statistical evidence.
He goes on to show how subscription to the likelihood principle minimizes a linear combination of type-I (α) and type-II (β) error rates, as opposed to the Neyman-Pearson procedure that minimizes type-II error rates (i.e., maximizes power) for a fixed type-I error rate (usually 5%).
Thus, if instead of minimizing β for a given α, we minimize [their linear combination], we must come to the same conclusion for all sample points which have the same likelihood function, no matter what the design. (p. 21)
A few choice quotes
page 19 (emphasis added):
The following example will be recognized by statisticians with consulting experience as a simplified version of a very common situation. An experimenter, having made n observations in the expectation that they would permit the rejection of a particular hypothesis, at some predesignated significance level, say .05, finds that he has not quite attained this critical level. He still believes that the hypothesis is false and asks how many more observations would be required to have reasonable certainty of rejecting the hypothesis if the means observed after n observations are taken as the true values. He also makes it clear that had the original n observations permitted rejection he would simply have published his findings. Under these circumstances it is evident that there is no amount of additional observation, no matter how large, which would permit rejection at the .05 level. If the hypothesis being tested is true, there is a .05 chance of its having been rejected after the first round of observations. To this chance must be added the probability of rejecting after the second round, given failure to reject after the first, and this increases the total chance of erroneous rejection to above .05. In fact … no amount of additional evidence can be collected which would provide evidence against the hypothesis equivalent to rejection at the P =.05 level
page 19-20 (emphasis added):
I realize, of course, that practical people tend to become impatient with counter-examples of this type. Quite properly they regard principles as only approximate guides to practice, and not as prescriptions that must be literally followed even when they lead to absurdities. But if one is unwilling to be guided by the α-postulate in the examples given, why should he be any more willing to accept it when analyzing sequential trials? The biostatistician’s responsibility for providing biomedical scientists with a satisfactory explication of inference cannot, in my opinion, be satisfied by applying certain principles when he agrees with their consequences and by disregarding them when he doesn’t.
page 22 (emphasis added):
The stopping rule is this: continue observations until a normal mean differs from the hypothesized value by k standard errors, at which point stop. It is certain, using the rule, that one will eventually differ from the hypothesized value by at least k standard errors even when the hypothesis is true. … The Bayesian viewpoint of the example is as follows. If one is seriously concerned about the probability that a stopping rule will certainly result in the rejection of a true hypothesis, it must be because some possibility of the truth of the hypothesis is being entertained. In that case it is appropriate to assign a non-zero prior probability to the hypothesis. If this is done, differing from the hypothesized value by k standard errors will not result in the same posterior probability for the hypothesis for all values of n. In fact for fixed k the posterior probability of the hypothesis monotonically approaches unity as n increases, no matter how small the prior probability assigned, so long as it is non-zero, and how large the k, so long as it is finite. Differing by k standard errors does not therefore necessarily provide any evidence against the hypothesis and disregarding the stopping rule does not lead to an absurd conclusion. The Bayesian viewpoint thus indicates that the hypothesis is certain to be erroneously rejected-not because the stopping rule was disregarded-but because the hypothesis was assigned zero prior probability and that such assignment is inconsistent with concern over the possibility that the hypothesis will certainly be rejected when true.
Vote for the next entry:
- Edwards, Lindman, and Savage (1963) — Bayesian Statistical Inference for Psychological Research (pdf)
- Rouder (2014) — Optional Stopping: No Problem for Bayesians (pdf)
- Gallistel (2009) — The Importance of Proving the Null (pdf)
- Berger and Delampady (1987) — Testing Precise Hypotheses (pdf)
Hi Alex, I’ll cast my vote in a few days! I have a related request, hope it is appropriate to post it here. I am trying to convince my co-workers that we should teach our undergraduate marketing students Bayesian statistics. I’d say, in the order that MIT teaches: first Bayesian, then frequentist (because it’s harder).
The papers give me lots of arguments. Yet, what I would need is a *comparison of both methods in real life applications*. Such as:
– inference of a population mean from a sample (in Bayesian case with both an informative and non-informative prior)
– comparison of two, three or four groups (let’s say in a marketing application: do young, middle-aged and elder customers behave differently in webshops?)
I’d like to show the practical advantages, such as more precise estimates, lower error rates (whether type I and II or type S and M), easier interpretation and so on. I did find some compairons in E.T Jaynes, but those could be dismissed as “fringe cases” – although Jaynes stated that he on purpose sought those cases to serve as a “telescope” to highlight the differences between the methods….)
Anyway, would you happen to know a source who does such a real life applications comparison?
Pieter,
Perhaps try looking through Zoltan Dienes’s papers:
Click to access Dienes%20How%20Bayes%20factors%20change%20our%20science.pdf
Click to access Dienes%20Bayes%20and%20the%20unconscious.pdf
http://journal.frontiersin.org/article/10.3389/fpsyg.2014.00781/full
One other thing – am I reasoning correctly here?
Assumptions:
– the physicists who work on problems such as black holes or string theory are smart people.
– money is not really a problem The LIGO experiment, the CERN particle accelerator etc. are expensive
– computer capacity is not a problem: they have the fastest super computers that are available.
Therefore, I think it is reasonable to assume that they pick the best sort of statistics for the job. Agreed? Now let’s see what they use.
1. The Higgs particle. Detected in the CERN particle accelerator. http://arxiv.org/pdf/1412.8662v2.pdf
Frequentist statistics used. This seems logical. The Higgs discovery (as far as I, as a non-physicist unerstand) was about doing many tests (many collisions Inside the particle accelerator), and comparing the results to some hypothesis. Millions of observations in a controlled environment.
2. Gravitational waves. https://dcc.ligo.org/public/0122/P1500217/014/LIGO-P1500217_GW150914_Rates.pdf Bayesian statistics used. This seems logical as well: you don’t observe millions of gravitational waves, by sheer luck they happed to observe *one* wave. Question: what is this particular observation telling?
Most of my (students’) problems are in the latter category: what can we learn from one particular observation? Therefore, most of the time, Bayesian statics is answering the right question.
Correct?
Pieter,
There was an interesting discussion (started by Bayesians) asking why frequentist statistics were used instead of Bayesian statistics for the higgs discovery. Perhaps you are interested: http://tonyohagan.co.uk/academic/pdf/HiggsBoson.pdf I have not followed the physics analyses very closely, so I don’t have much to say about them myself.
In general, I would agree that if you have very limited resources then using Bayesian statistics is ideal. Since you can add relevant background knowledge into the analysis, you can make stronger commitments to theory, and subsequently you can make informative inferences even with few observations.
And by the way, my vote for the next article to be discussed is
4. Berger and Delampady (1987) — Testing Precise Hypotheses