The following discussion is essentially nontechnical; the aim is only to convey a little introductory “feel” for our outlook, purpose, and terminology, and to alert newcomers to common pitfalls of understanding.
Sometimes, in our perplexity, it has seemed to us that there are two basically different kinds of mentality in statistics; those who see the point of Bayesian inference at once, and need no explanation; and those who never see it, however much explanation is given.
–Jaynes, 1986 (pdf link)
The format of this series is short and simple: Every week I will give a quick summary of a paper while sharing a few excerpts that I like. If you’ve read our eight easy steps paper and you’d like to follow along on this extension, I think a pace of one paper per week is a perfect way to ease yourself into the Bayesian sphere.
Bayesian Methods: General Background
The necessity of reasoning as best we can in situations where our information is incomplete is faced by all of us, every waking hour of our lives. (p. 2)
In order to understand Bayesian methods, I think it is essential to have some basic knowledge of their history. This paper by Jaynes (pdf) is an excellent place to start.
[Herodotus] notes that a decision was wise, even though it led to disastrous consequences, if the evidence at hand indicated it as the best one to make; and that a decision was foolish, even though it led to the happiest possible consequences, if it was unreasonable to expect those consequences. (p. 2)
Jaynes traces the history of Bayesian reasoning all the way back to Herodotus in 500BC. Herodotus could hardly be called a Bayesian, but the above quote captures the essence of Bayesian decision theory: take the action that maximizes your expected gain. It may turn out to be the wrong choice in the end, but if your reasoning that leads to your choice is sound then you took the correct course.
After all, our goal is not omniscience, but only to reason as best we can with whatever incomplete information we have. To demand more than this is to demand the impossible; neither Bernoulli’s procedure nor any other that might be put in its place can get something for nothing. (p. 3)
Much of the foundation for Bayesian inference was actually laid down by James Bernoulli, in his work Ars Conjectandi (“the art of conjecture”) in 1713. Bernoulli was the first to really invent a rational way of specifying a state of incomplete information. He put forth the idea that one can enumerate all “equally possible” cases N, and then count the number of cases for which some event A can occur. Then the probability of A, call it p(A), is just M/N, or the number of cases on which A can occur (M) to the total number of cases (N).
Jaynes gives only a passing mention to Bayes, noting his work “had little if any direct influence on the later development of probability theory” (p. 5). Laplace, Jeffreys, Cox, and Shannon all get a thorough discussion, and there is a lot of interesting material in those sections.
Despite the name, Bayes’ theorem was really formulated by Laplace. By all accounts, we should all be Laplacians right now.
The basic theorem appears today as almost trivially simple; yet it is by far the most important principle underlying scientific inference. (p. 5)
Laplace used Bayes’ theorem to estimate the mass of Saturn, and, by the best estimates when Jaynes was writing, his estimate was correct within .63%. That is very impressive for work done in the 18th century!
This strange history is only one of the reasons why, today [speaking in 1984], we Bayesians need to take the greatest pains to explain our rationale, as I am trying to do here. It is not that it is technically complicated; it is the way we have all been thinking intuitively from childhood. It is just so different from what we were all taught in formal courses on “orthodox” probability theory, which paralyze the mind into an inability to see a distinction between probability and frequency. Students who come to us free of that impediment have no difficulty in understanding our rationale, and are incredulous to anyone that could fail to understand it. (p. 7)
The sections on Laplace, Jeffreys, Cox and Shannon are all very good, but I will skip most of them because I think the most interesting and illuminating section of this paper is “Communication Difficulties” beginning on page 10.
Our background remarks would be incomplete without taking note of a serious disease that has afflicted probability theory for 200 years. There is a long history of confusion and controversy, leading in some cases to a paralytic inability to communicate. (p.10)
Jaynes is concerned in this section with the communication difficulties that Bayesians and frequentists have historically encountered.
[Since the 1930s] there has been a puzzling communication block that has prevented orthodoxians [frequentists] from comprehending Bayesian methods, and Bayesians from comprehending orthodox criticisms of our methods. (p. 10)
On the topic of this disagreement, Jaynes gives a nice quote from L.J. Savage: “there has seldom been such complete disagreement and breakdown of communication since the tower of Babel.” I wrote about one kind of communication breakdown in last week’s Sunday Bayes entry.
So what is the disagreement that Jaynes believes underlies much of the conflict between Bayesians and frequentists?
For decades Bayesians have been accused of “supposing that an unknown parameter is a random variable”; and we have denied hundreds of times with increasing vehemence, that we are making any such assumption. (p. 11)
Jaynes believes the confusion can be made clear by rephrasing the criticism as George Barnard once did.
Barnard complained that Bayesian methods of parameter estimation, which present our conclusions in the form of a posterior distribution, are illogical; for “How could the distribution of a parameter possibly become known from data which were taken with only one value of the parameter actually present?” (p. 11)
Aha, this is a key reformulation! This really illuminates the confusions between frequentists and Bayesians. To show why I’ll give one long quote to finish this Sunday Bayes entry.
Orthodoxians trying to understand Bayesian methods have been caught in a semantic trap by their habitual use of the phrase “distribution of the parameter” when one should have said “distribution of the probability”. Bayesians had supposed this to be merely a figure of speech; i.e., that those who used it did so only out of force of habit, and really knew better. But now it seems that our critics have been taking that phraseology quite literally all the time.
Therefore, let us belabor still another time what we had previously thought too obvious to mention. In Bayesian parameter estimation, both the prior and posterior distributions represent, not any measurable property of the parameter, but only our own state of knowledge about it. The width of the distribution is not intended to indicate the range of variability of the true values of the parameter, as Barnards terminology had led him to suppose. It indicates the range of values that are consistent with our prior information and data, and which honesty therefore compels us to admit as possible values. What is “distributed” is not the parameter, but the probability. [emphasis added]
Now it appears that, for all these years, those who have seemed immune to all Bayesian explanation have just misunderstood our purpose. All this time, we had thought it clear from our subject-matter context that we are trying to estimate the value that the parameter had at the time the data were taken. [emphasis original] Put more generally, we are trying to draw inferences about what actually did happen in the experiment; not about the things that might have happened but did not. (p. 11)
I think if you really read the section on communication difficulties closely, then you will see that a lot of the conflict between Bayesians and frequentists can be boiled down to deep semantic confusion. We are often just talking past one another, getting ever more frustrated that the other side doesn’t understand our very simple points. Once this is sorted out I think a lot of the problems frequentists see with Bayesian methods will go away.