Probably Random

A blog by Paul Siegel

Jensen's inequality and probability

September 01, 2019

This post is all about Jensen’s inequality, an elementary but powerful result about convex functions. It comes up all over the place in mathematics, probably because the concept of convexity is itself so fundamental. I ran into it recently as I was thinking about relative entropy and why it is so well behaved; I think a lot of it comes down to the fact that the relative entropy from one probability distribution to another is always positive - a result known as Gibbs’ inequality - and this follows very easily from Jensen’s inequality.

But that will be for another blog post; in this one I will carefully formulate and prove Jensen’s inequality and give a few fun applications in basic statistics.

Convex functions

Before even stating Jensen’s inequality, we need to go over the basics of the theory of convex functions. A function is said to be convex is the set of points in the plane which lie above its graph form a convex set. More formally:

Let f:IR be a function defined on an interval IR. f is convex if f(ta+(1t)b)tf(a)+(1t)f(b) for all a,bI and all t[0,1].

The left-hand side of the inequality represents the values of f at points between a and b, while the right-hand side represents values of the line joining the points (a,f(a)) and (b,f(b)); thus the inequality says that the graph lies entirely below any line joining two of its points.

Just staring at the definition, it is not clear how one would go about constructing any examples. Those who remember their calculus may think of the second derivative: if f(2)(x)>0 for all x then it means that f “curves up” in some sense, and this feels like it ought to imply that f is convex. Not all convex functions are twice (or even once!) differentiable, but this intuition is on the right track. We can formalize it by introducing a generalization of the concept of a tangent line, which proceeds by means of the following lemma:

Let f:IR be a convex function defined on an interval IR. For any a,x0,bI such that a<x0<b, we have:

f(x0)f(a)x0af(b)f(a)baf(b)f(x0)bx0

Set t=bx0ba, so that 1t=x0aba; the reader may check that for this choice of t we have x0=ta+(1t)b. By convexity, we have:

f(x0)tf(a)+(1t)f(b)=bx0baf(a)+x0abaf(b)=x0f(b)f(a)ba+bf(a)af(b)ba

Subtract f(a) from both sides to obtain:

f(x0)f(a)x0f(b)f(a)ba+bf(a)af(b)baf(a)=x0f(b)f(a)ba+bf(a)af(b)babf(a)af(a)ba=x0f(b)f(a)baaf(b)f(a)ba=(x0a)f(b)f(a)ba

Dividing both sides through by (x0a) gives the first half of the claimed inequality. To get the second half subtract f(b) instead of f(a), multiply through by 1, and proceed as above.

Now, define

m=sup{f(x0)f(a)x0a:aI,a<x0},m+=inf{f(b)f(x0)bx0:bI,b>x0}

Both m and m+ are finite by the previous lemma, the reader may check that mm+.

Show that m=m+ if and only if f is differentiable at x0, and in that case m=m+=f(x0).

Let f(x)=|x| and let x0=0. What are m and m+?

The main significance of m and m+ for our purposes is that they can be used to construct lines which behave a little bit like tangent lines for f, in the sense of the following lemma:

Let f:IR, x0, m, and m+ be defined as above. For any m[m,m+], define (x)=f(x0)+m(xx0). Then (x)f(x) for all xI.

To begin, suppose x<x0. By definition, we have:

mmf(x0)f(x)x0x=f(x)f(x0)xx0

So since (xx0) is negative we get:

(x)=f(x0)+m(xx0)f(x0)+f(x)f(x0)xx0(xx0)=f(x)

The case x>x0 follows similarly, using the inequality mm+ instead.

Because of this lemma, numbers in the interval [m,m+] are often called subderivatives of f at x0. We’re more interested in the lines themselves, and we’ll give them a special name:

Let f:IR be a function defined on an interval I. A supporting line for f at x0I is a linear function (x) with the property that (x0)=f(x0) and (x)f(x) for all xI.

Show that f(x)=x3 has no supporting line at x0=0.

The existence of supporting lines at every point characterizes convexity, as per the next proposition:

Let f:IR be a function defined on an interval I. Then f is convex if and only if there is a supporting line for f at every point x0I.

If f is convex then the previous lemma implies that (x)=f(x0)+m(xx0) is a supporting line to f at x0 for any subderivative m.

So assume that f has a supporting line at every point; we aim to show that f is convex. Following the definition of convexity, fix a,bI and let xt=ta+(1t)b for arbitrary t[0,1]. Let (x) be a supporting line for f at xt; we have:

f(ta+(1t)b)=f(xt)=(xt)=t(a)+(1t)(b)by linearity=≤tf(a)+(1t)f(b)

Since a, b, and t were arbitrary, f is convex.

The condition on supporting lines is often easier to use in examples than the definition of convexity because it can be checked pointwise. To conclude our discussion of convex functions, let us use calculus to show that a lot of familiar functions like f(x)=ex or f(x)=x2 are convex.

Let f:IR be a twice continuously differentiable function on an interval I. If f(2)(x)0 for all xI then every tangent line to f is a supporting line and in particular f is convex.

Our strategy is to use the Lagrange form of the error term in Taylor’s theorem. This asserts that for any base point x0I and any other point xI there is some c between x0 and x such that:

f(x)=f(x0)+f(x0)(xx0)+f(2)(c)2!(xx0)2

The tangent line (x)=f(x0)+f(x0)(xx0) clearly satisfies (x0)=f(x0), and we have (x)f(x) since f(2)(c)0.

Jensen’s inequality

We are now ready to formulate and prove Jensen’s inequality. It is an assertion about how convex functions interact with expected values of random variables, and we will formulate it on an abstract measure space (Ω,Σ,P) where Ω is a set, Σ is a σ-algebra of subsets of Σ, and P is a probability measure on (Ω,Σ). Readers uncomfortable with this formalism are welcome to restrict to the case where Ω=\brω1,,ω is a finite set for which we assign probabilities P(ωi)=pi where the pi’s are nonnegative and sum to 1. This is enough for the applications in this post!

Given a random variable X:ΩR, recall that the expectation of X is by definition:

E(X)=ΩXdP

In the case where Ω is finite, this just reduces to E(X)=ni=1piX(ωi). We will only really use three properties of the expectation operator:

Finally, given a random variable X:ΩR and a function f:RR, recall that f(X) is by definition the random variable fX:ΩR. This still makes sense even if f is defined only on the range of X.

Let X be a random variable defined on a probability space (Ω,Σ,P) and let f:IR be a convex function defined on an interval IR containing the range of X. Then:

f(E(X))E(f(X))

with equality if and only if either X is constant or f coincides with a line on an interval which contains the range of X (up to sets of measure zero).

Since f is convex it admits a supporting line at the point x0=E(X). This means (E(X))=f(E(X)) and (x)f(x) for all xI; in particular, (X)f(X). By monotonicity and linearity of expectation, we get:

E(f(X))E((X))=(E(X))=f(E(X))

To handle the equality case, observe that:

E(f(X))E((X))=E((f)(X))=Ω(f)XdP

But the integrand is nonnegative since (x)f(x), so the equality case is equivalent to (f)X=0 almost everywhere, or in other words f= almost everywhere on the range of X.

Let R denote the set of all points in the range of X at which f=. If R contains only one point then X is constant almost everywhere; otherwise let a<b be two distinct points in R. Any point between a and b can be expressed as xt=ta+(1t)b for some t[0,1], and we have:

(xt)f(xt)tf(a)+(1t)f(b)=t(a)+(1t)(b)=(ta+(1t)b)=(xt)

Plainly, this forces f(xt)=(xt). It follows that f= on the interval [a,b], and by taking the union over all such pairs we see that f= on the interval (infR,supR) which contains the range of X up to a set of measure zero.

The proof of this result was relatively straightforward, but it depended crucially on all of the setup work that we did involving convex functions and supporting lines. We’ll see this pay off in the next section.

Applications

We will give two applications here: a classical inequality involving different sorts of averages and an inequality involving the mean, median, and standard deviation. I don’t know significantly a significantly easier way to deduce either inequality.

Arithmetic, geometric, and harmonic means

Let x1,,xn be a list of positive numbers. There are a number of different notions of “average” for such a list - I wrote an elementary blog post about the subject here. These include:

A(x1,,xn)=x1++xnn
G(x1,,xn)=(x1xn)1n
H(x1,,xn)=n1x1++1xn

Our main result is that the arithmetic mean is always larger than the geometric mean, which in turn is always larger than the harmonic mean. This is difficult to prove! The approach using Jensen’s inequality is by far the simplest that I know.

The first step is also perhaps the cleverest: to introduce probabilistic language. Let Ω=\brω1,,ωn be a set with n elements, and introduce the unform probability measure determined by P(ωi)=1n for each i. Define a random variable X on Ω by X(ωi)=xi, and for short write A(X) instead of A(x1,,xn) and similarly for G and H.

The proof of the inequality will use several identities which follow immediately from the definitions (and are left to the reader):

  1. A(X)=E(X)
  2. G(X)=eE(logX)
  3. H(X)=1E(1/X)
  4. G(X)=1E(1/X)

For any finite set x1,,xn of positive numbers:

H(x1,,xn)G(x1,,xn)A(x1,,xn)

with equality if and only if x1==xn.

We’ll start with the inequality G(X)A(X). Consider the function f(x)=logx defined on the interval (0,) (which contains the range of X since the xi’s are assumed to be positive). We have f(2)(x)=1x2>0 for all x, so by the final result in our discussion of convex functions we know that f is convex. Thus we may apply Jensen’s inequality to get:

logE(X)E(logX)

We can remove the minus signs at the cost of reversing the inequality:

logE(X)E(logX)

Exponentiating both sides preserves the inequality since ex is an increasing function of x, so we get:

E(X)eE(logX)

Using identities 1 and 2 above, we recognize the left-hand side as A(X) and the right-hand side as G(X), as desired.

To handle the equality case, recall that Jensen’s inequality is an equality if and only if X is constant or f coincides with a linear function on an interval containing the range of X. But the logarithm function does not restrict to a linear function on any interval, so we get equality between the arithmetic and geoemetric mean if and only if all of the xi’s are equal.

The inequality involving the harmonic and geometric means follows from the arithmetic / geometric mean inequality applied to the random variable 1X (together with identities 3 and 4):

H(X)=1E(1/X)=1A(1/X)1G(1/X)=G(X)

The analysis of the equality case follows from the arithmetic / geometric mean equality analysis.

In addition to being simple and elegant, this argument generalizes with basically no extra effort to the case of weighted averages and to continuous probability distributions thanks to the generality in which we formulated Jensen’s inequality.

How far is the median from the mean?

Here is a very simple question coming from statistics: given a finite sequence x1xn of numbers, what is the furthest the median can be from the (arithmetic) mean? Recall that the median is by definition the value xi occurring at index i=n+12 if n is odd, and if n is even it is the average of the values xi and xi+1 where i=n2. (In fact, any value between xi and xi+1 would do for the present purposes.)

This question has a very simple answer: the median is always within one standard deviation of the mean. This is a very elementary statement, but I don’t think it is easy to prove - one way or the other some mechanism for estimating expectations of random variables is required.

To begin, we will need a clever variational characterization of the median, which is interesting in its own right. As in the previous section, equip the finite set Ω=\brω1,,ωn with the uniform probability measure, and let X be the random variable on this probability space defined by X(ωi)=xi.

The minimum value of the absolute deviation function f(t)=E(|Xt|) occurs at the median of X.

Assume the xi’s are arranged in increasing order, so that x1xn. For any t<x1 we have |xit|>|xix1| for all i, so f cannot be minimized at a value of t smaller than x1. Similarly, f cannot be minimized at a value of t larger than xn. Thus we can restrict f to the interval [x1,xn]; f is clearly continuous, so we know that f attains its minimum value somewhere by compactness.

For any t[x1,xn] we have:

|x1t|+|xnt|=tx1+xnt=xnx1

Thus at such t we have:

f(t)=E(|Xt|)=1nni=1|xit|=xnx1n+1nn1i=2|xit|

It follows that f attains its minimum value at the same point as the absolute deviation function for the points x2,,xn1. Applying this argument inductively, we reduce either to the one point set \brx(n+1)/2 if n is odd or the two point set \brxn/2,xn/2+1 if n is even.

In the first case, the absolute deviation function is t|x(n+1)/2t| which is minimized precisely at t=x(n+1)/2. In the second case, the absolute deviation function is t|xn/2t|+|xn/2+1t|, and as above this is minimized when t[xn/2,xn/2+1] where it takes the constant value xn/2+1xn/2. In either case, the minimum value occurs at the median.

We’re now ready to prove the main result.

Let μ denote the arithmetic mean of the numbers x1,,xn, let m denote their median, and let σ=E((Xμ)2) denote their standard deviation. Then:

|μm|σ

with equality if and only if the xi’s are all equal.

Let X denote the random variable whose values are x1,,xn as above. Clearly μm=E(X)m=E(Xm). The function x|X| is convex ((x)=x is a supporting line at all x00, and (x)=x is a supporting line at all x00), so by Jensen’s inequality we have:

|μm|=|E(Xm)|E(|Xm|)

According to the previous proposition E(|Xm|) is the minimum value of the function tE(|Xt|), so plugging in t=μ we get:

E(|Xm|)E(|Xμ|)

The function xx2 is convex (it’s second drivative is 2 at all points), so by Jensen’s inequality again we have:

E(|Xμ|)2E((Xμ)2)

Taking square roots and putting it all together, we get:

|μm|E(|Xμ|)E((Xμ)2)=σ

Note that the equality case for the second application of Jensen’s inequality requires that |Xμ| is constant since x2 is not linear on any interval, and this random variable can only be constant if X is constant.

Both applications of Jensen’s inequality in this argument were a little cheap; the first one is really just the triangle inequality, and the second one is the Cauchy-Schwarz inequality. But expressing the argument in this way buys a lot of generality for free; for instance if x1,,xn are points in Rd instead of R then one can define the spatial median to be any point which minimizes the absolute deviation function tE(|Xt|), and the argument above gives an inequality between the mean, spatial median, and standard deviation almost verbatim. The argument also has generalizations to weighted and continuous probability distributions, all of which come for free without having to fuss about what the standard inequalities should look like.