The principle of maximum entropy

January 27, 2022

Problem: You encounter an apparently normal six-sided die, but you are told that the average roll is $4.5$. The die must be weighted, or else its average roll would be $3.5$. How do you estimate the probability that the face $2$ is rolled?

From the point of view of Bayesian modeling, we are looking for the “most uninformed” prior for the die, i.e. the probability distribution which is as close as possible to the uniform distribution while satisfying the constraint that its mean must be $4.5$. Let $u$ denote the uniform distribution, and let $\Theta$ denote the space of all probability distributions $p$ with mean $4.5$; we seek:

\[\mathop{\mathrm{arg\,min}}_{p \in \Theta}\: \P(p \vert u)\]

It is not immediately obvious what the expression $\P(p \vert u)$ should mean: after all, $p$ is a model, not an event or the outcome of an experiment. We will interpret the expression as the probability that a sequence of die rolls drawn from the distribution $p$ could have been generated by $u$. We expect this probability to decay to zero as the number of die rolls increases, but we can still choose the distribution $p$ which yields the highest probability asymptotically.

Multinomial likelihood

Let’s start by analyzing the statistics of die rolls in general. Some notation:

$k$: number of faces
$N$: number of die rolls
$q_1, \ldots, q_k$: probabilities for each face; must satisfy $\sum_i q_i = 1$
$r_1, \ldots, r_k$: number of times each face was rolled; must satisfy $\sum_i r_i = N$

The probability that the counts $r_i$ occur in $N$ rolls of a $k$-sided die with face weights $q_i$ is given by the multinomial distribution:

\[\frac{N!}{r_1! r_2! \ldots r_k!} q_1^{r_1} q_2^{r_2} \ldots q_k^{r_k}\]

This is called the likelihood model for die rolls. The same likelihood model applies any time we make repeated, independent observations of a random process that selects from a finite set of states. Classification problems in machine learning often have this form, for example.

Empirical distance between models

Let us now try to compute $\P(p \vert q)$ where $p$ and $q$ are two probability distributions describing the same die. As above, we interpret this to mean the probability that a typical sequence of die rolls generated by $p$ could have been generated by $q$ instead. We can think of this as a sort of empirical distance between $p$ and $q$, though note that it is not symmetric for most pairs of distributions.

Asymptotics of the multinomial distribution

Imagine we are given the opportunity to roll the die $N$ times, where $N$ is very large, and we empirically observe that the $i$th face is rolled $r_i$ times. If the distribution $p$ is the correct model for the die, then we should have $p_i \approx \frac{r_i}{N}$ for $N$ large. As in the previous section, the probability of observing the empirical probabilities $p = (p_1, \ldots, p_k)$ given the prior $q = (q_1, \ldots, q_k)$ is given by:

\[\P(p \vert q) = \frac{N!}{(p_1 N)! \ldots (p_k N)!} q_1^{p_1 N} \ldots q_k^{p_k N}\]

This expression is a little easier to work with if we take logarithms:

\[\log \P(p \vert q) = \log N! - \sum_i \log (p_i N)! + N \sum_i p_i \log q_i\]

At this point we invoke a remarkable and ubiquitous asymptotic formula due to Stirling:

We have:

\[n! = \sqrt{2 \pi n} \left( \frac{n}{e} \right)^n \left( 1 + O \left(\frac{1}{n} \right) \right)\]

as $n \to \infty$

We have:

\[\log n! = n \log n - n + \frac{1}{2} \log 2 \pi n + \eps_n\]

where $\eps_n \to 0$ as $n \to \infty$.

We will write $\log n! \approx n \log n - n$, with the understanding that the error of this approximation has order $\log n$. Substituting into our expression for $\P(p \vert q)$, we get:

\[\log \P(p \vert q) \approx N \log N - N - \sum_i \left( p_i N \log p_i N - p_i N \right) + N \sum_i p_i \log q_i\]

The middle term simplifies a bit, since $\sum_i p_i = 1$:

\[\begin{align*} \sum_i \left( p_i N \log p_i N - p_i N \right) &= \sum_i \left( p_i N \log p_i + p_i N \log N - p_i N \right) \\ &= N \sum_i p_i \log p_i + (N \log N - N) \sum_i p_i \\ &= N \sum_i p_i \log p_i + N \log N - N \end{align*}\]

Substitute this in above:

\[\log \P(p \vert q) \approx -N \sum_i \left( p_i \log p_i - p_i \log q_i \right) = -N \sum_i p_i \log \frac{p_i}{q_i}\]

Relative entropy

We have stumbed into the quantity $\sum_i p_i \frac{p_i}{q_i}$, which the reader may recognize from information theory. If not, the terminology is as follows:

Let $p$ and $q$ be probability distributions defined on the same finite probability space, and assume that $p$ is absolutely continuous with respect to $q$, meaning an event has $p$-probability $0$ whenever it has $q$-probability $0$. Define the KL-divergence or alternatively relative entropy of $p$ given $q$ to be:

\[\KL{p}{q} = \sum_i p_i \log \frac{p_i}{q_i}\]

(with the convention $0 \log 0 = 0$).

Combining this notation with the approximation above, we conclude:

\[\P(p \vert q) \approx e^{-N \KL{p}{q}}\]

as $N \to \infty$.

Recall that the version of Stirling’s approximation that we used has error on the order of $\log N$; consequently it would be more accurate to write

\[\P(p \vert q) \approx N e^{-N \KL{p}{q}}\]

However, this doesn’t affect our asymptotic analysis too much because the exponential term decays to zero much faster than $N$ grows.

Entropy optimization

Here is where we stand: we started with a die whose true probability distribution is $p$ and imagined rolling it $N$ times for a very large number $N$. We computed the probability that a sequence with the same statistical properties could have been generated by a prior distribution $q$ and arrived at the asymptotic formula:

\[\P(p \vert q) \approx e^{-N \KL{p}{q}}\]

According Gibbs’ inequality the relative entropy $\KL{p}{q}$ is nonnegative, and it vanishes if and only if $p = q$. This means that $\P(p \vert q)$ exponentially decays to $0$ unless our prior exactly matches the empirical probabilities.

Let us return to our original problem, where we do not know $p$ exactly but we do know that it belongs to some family $\Theta$ of distributions which does not necessarily contain $q$. Because $\KL{p}{q}$ is nonnegative, the overwhelming weight of the probability $\P(p \vert q)$ is concentrated near the distribution which minimizes $\KL{p}{q}$ for $p \in \Theta$ when $N$ is large. In other words:

Let $\Theta$ be a family of probability distributions defined on a finite sample space which are absolutely continuous with respect to a fixed prior distribution $q$. Then:

\[\mathop{\mathrm{arg\,min}}_{p \in \Theta}\: \P(p \vert q) = \mathop{\mathrm{arg\,min}}_{p \in \Theta}\: \KL{p}{q}\]

In the literature this is usually called the principle of minimum discrimination information.

The special case where our prior is the uniform distribution $u$ is common enough that it’s worth handling explicitly:

\[\begin{align*} \KL{p}{u} &= \sum_i \left( p_i \log p_i - p_i \log \frac{1}{k} \right) \\ &= \sum_i p_i \log p_i + \log k \sum_i p_i \\ &= \log k + \sum_i p_i \log p_i \end{align*}\]

Once again, we have encountered a familiar object from information theory:

Given a probability distribution $p$ defined on a finite sample space, the Shannon entropy of $p$ is:

\[H(p) = -\sum_i p_i \log p_i\]

(with the convention $0 \log 0 = 0$).

The sign convention in this definition is to ensure that $H(p)$ is always positive.

Laplace’s principle of indifference asserts that we should always assume that two states of a system are equally probable unless we have some evidence to the contrary, which in the present context means we should always start with the uniform prior. This leads to the following better-known corollary to the principle of minimum relative entropy:

Let $\Theta$ be a family of probability distributions defined on a finite sample space. Then:

\[\mathop{\mathrm{arg\,min}}_{p \in \Theta}\: \P(p \vert u) = \mathop{\mathrm{arg\,max}}_{p \in \Theta}\: H(p)\]

Finally, in the machine learning literature one often encounters the cross entropy, defined by:

\[H(p,q) = -\sum_i p_i \log q_i = H(p) + \KL{p}{q}\]

If $p$ is fixed and $q$ varies, then minimizing $H(p,q)$ is the same as minimizing $\KL{p}{q}$, and so $H(p,q)$ is often used as a loss function for optimization.

Back to the die problem

Let us finally return to the problem which kicked off this post. Let $\Theta$ denote the space of all probability distributions on a six-sided die which have mean $4.5$. According to the principle of maximum entropy, the prior we should adopt for this die is:

\[\mathop{\mathrm{arg\,max}}_{p \in \Theta}\: H(p)\]

Writing $p = (p_1, \ldots, p_6)$, we can express this as a constrained optimization problem. We must maximize:

\[H(p) = -\sum_{i=1}^6 p_i \log p_i\]

subject to the constraints:

\[\sum_{i=1}^6 p_i = 1, \quad \sum_{i=1}^6 i p_i = 4.5\]

There are numerous techniques for solving constrained optimization problems of this sort; I will review two of them in the posts linked below. The final result is:

\[p = (0.0544, 0.0788, 0.1142, 0.1654, 0.2398, 0.3475)\]

Probably Random

A blog by Paul Siegel