*TLDR: I typed up some of my technical notes where I derive the Kullback-Leibler divergence for some common distributions. Find them here on PsyArXiv.*

The Kullback-Leibler (KL) divergence is a concept that arises pretty frequently across many different areas of statistics. I recently found myself needing to use the KL divergence for a particular Bayesian application, so I hit up google to find resources on it. The wikipedia page is not exactly … hmm how should I say this … *friendly?* Fortunately, there are a few nice tutorials online explaining the general concept, such as this, or this, or this (they are all nice but the statistician in me seems to prefer the third link).

Essentially, if we have two competing distributions/models that could have generated the data, the KL divergence gives us the expected log likelihood ratio in favor of the true distribution. (A refresher on likelihoods and likelihood ratios is here.) The log likelihood ratio can be interpreted as the amount of evidence the data provide for one model versus another, so the KL divergence tells us how much evidence we can expect our data to provide in favor of the true model.

It turns out that the KL divergence is pretty damn useful for tons of practical stuff, so it’s a good thing to know. For instance, one can use the KL divergence to design the optimal experiment, in terms of having the most efficient accumulation of evidence. If design A has higher KL divergence than design B, then we expect to gain more evidence per observational unit using design A than design B.

I thought about writing a little primer on KL divergence, but I don’t know if the world needs another conceptual tutorial on this; as I already mentioned, there are some good ones out there already. However, there aren’t many resources online that walk you through how you might actually derive the KL divergence in practice (i.e., for non-toy distributions). Seriously, where are the worked examples? I really doubt anyone can get a feel for a concept like this without seeing a few cases worked out in detail.

At some point after searching for a while I got fed up, and I did what any other soon-to-be-card-carrying-statistician would do: I sat down and worked some examples out for myself. (Not gonna lie, I was pretty proud of myself for having the confidence to jump right into it. A few years ago I may have just given up after my failed google search. Anyway…)

For instance, what is the KL divergence between two normal distributions with different means but *the same* variance? If you google hard enough (or work it out yourself) you would find that in this case the KL divergence is the squared difference in means divided by twice the (common) variance. That is, the KL divergence is half of the squared standardized mean difference. Thus, the expected log likelihood ratio between a N(0,1) distribution and a N(2,1) distribution is (2-0)²/(2*1) = 2.

How precisely do we go from the definition of KL divergence to this result? This question started my KL quest, and that’s the point where my technical notes come in. The distributions covered in these notes include Bernoulli, Geometric, Poisson, Exponential, and Normal.

My technical notes are available via PsyArXiv. Any comments, feedback, or requests for other derivations are welcome.

I came across your original tweet from the Levenshtein angle regarding corrections in lines in texts, but never mind.

KL is very useful in establishing distances in word frequencies between doc1 and doc2, including ‘correction on the press’ in 17thC book 1st edition 1 and book 1st edition 2. Also useful for construction of medieval manuscript stemmas.

These applications do not benefit from KL applied to materials in such as your calcs for different distributions, or do they?