What's in a Mean?
March 11, 2017
Quick: what is the average of 10 and 20? If you blurted out “15” then you are not wrong: we all learned in our math and science classes that the average of a collection of numbers is the ratio of their sum by the number of numbers in the collection; in this case, $\frac{10 + 20}{2} = 15$.
But outside of the classroom we do not compute averages for the sake of testing our skills in arithmetic: we do it because we want to use the answer for something. Usually the goal is to summarize a large amount of data, such as the incomes of all American households or the speeds of motorists in Italy. In this post I will argue that the kind of data being summarized matters a great deal, and I will explain how to construct the right notion of average for a particular data set. I will try to keep the exposition relatively light - just some arithmetic and basic algebra - but you will need to keep the numerical side of your brain engaged.
Averages: when and why?
Let us begin by noting that it is not always possible or valuable to compute the average of a set of numbers. Some data sets simply do not deserve to be averaged: for instance, you would likely look at me quizzically if I asked you to compute the average of 10 apples and 20 miles per hour. This is because it is not at all clear how a single number could summarize both a quantity of apples and a speed measurement; these numbers simply don’t belong in the same data set.
There are other data sets in which all values have the same type but they still should not be averaged. For instance, it may technically be true that the average of my net worth and that of Bill Gates is over $40 billion, but this summary of our net worths is not particularly useful for very many purposes. Indeed, for most purposes I would argue that there is no good way to summarize our net worths. This is more or less the outlier problem, and it is sufficiently familiar to those with some experience with statistics that I will not dwell on it further here.
So let us restrict our attention to data sets which consist only values of the same type (e.g. only speeds measured in miles per hour) and that the values are not so far off from one another that they become incomparable. Such data sets are entitled to be summarized with an average, but why would we choose to do so? I think there are two fundamental reasons:
- The average can be treated as “normal” behavior for the data, so that observations significantly above or below the average can be flagged for further investigation or action.
- The average can be used to compare like data sets; for instance the average household income in New York can be compared to the average household income in Michigan to draw conclusions about larger social or economic trends in the two states.
Thus the essence of an average is as follows:
An average of a collection of numbers is a value which can be used as a proxy for a "typical" number in the collection.
I will make this more concrete in the next section, but first I would like to call attention to my use of the indefinite article “an” rather than “the”: the central point of this post is that there is more than one way to summarize a data set.
Different notions of “average”
With a clearer goal for computing averages in mind, specific recipes for computing them emerge naturally by carefully considering what a “typical” value in a given data set should mean. This is best illustrated by looking at some examples.
Speed by time intervals
Suppose you drive for one hour at 10 miles per hour (mph) and an additional hour at 20 mph. What is your average speed?
The most sensible way to make sense of “average” in this case is to look for a single speed at which you could have travelled for both hours and still covered the same distance. You covered 10 miles in the first hour and 20 miles in the second hour, so your total distance covered was 30 miles. Meanwhile the trip took 2 hours in total, so your average speed was:
\[\overline{v} = \frac{\text{distance}}{\text{time}} = \frac{10 + 20}{2} = 15 \text{mph}\]This is called the arithmetic mean of $10$ and $20$, and it is the classical notion of average that we all know and love.
Speed by distances
Suppose you drive for one mile at 10 mph and an additional mile at 20 mph. What is your average speed?
Again, the “average” speed should be the single speed at which you could have travelled for the whole trip while still covering the same distance. The total distance is straightforward: it is simply 2 miles. The total time is a little more complicated: at 10 mph it takes $\frac{1}{10}$ hours to travel one mile, and at 20 mph it takes $\frac{1}{20}$ hours. So the average speed should be:
\[\overline{v} = \frac{\text{distance}}{\text{time}} = \frac{2}{\frac{1}{10} + \frac{1}{20}} = \frac{40}{3} \text{mph}\]This is called the harmonic mean of $10$ and $20$. It is not as common as the arithmetic mean, but it arises naturally when the quantities being averaged are best expressed as ratios of other quantities.
Growth rates
You take over as CEO of a company. During your first year sales increase by 10% over the previous year, and during your second year sales increase by 20% over your first year. What is the average growth rate of sales over your first two years?
Starting with the total sales $S$ in the year before you started, 10% growth in your first year corresponds to total sales $1.1 S$, and 20% additional growth in your second year corresponds to total sales of $(1.2 \cdot 1.1) S$. The average growth rate should should be a number $r$ which, if repeated each year, would attain the same total sales in your second year. In other words we want:
\[r^2 S = (1.1 \cdot 1.2) S\]which gives:
\[r = \sqrt{1.1 \cdot 1.2} \approx 1.1489\]Thus the average growth rate is about 14.89% per year. The number $\sqrt{1.1 \cdot 1.2}$ is called the geometric mean of 1.1 and 1.2.
Averages and symmetry?
The three averages that emerged in the examples in the last section - the arithmetic, harmonic, and geometric means - come up frequently in applications, but they are by no means the end of the story. For instance, if one amends the first example so that the car drives at a speed $v_1$ for two hours and a speed $v_2$ for one hour then the average speed is a weighted arithmetic mean of $v_1$ and $v_2$:
\[\overline{v} = \frac{2 v_1 + v_2}{2}\]It is not hard to invent notions of average that don’t really have names; for instance, if a car drives at a speed of $v_1$ for one hour and then $v_2$ for one mile, the reader can check that the proper notion of average speed is:
\[\overline{v} = \frac{v_1 + 1}{1 + \frac{1}{v_2}}\]I’ve never seen a formula like this in the literature, but it is just as deserving as any of the other quantities that we have seen so far to be called an average. In all cases the objective is to find one or more conserved quantities (e.g. total distance, total time, total sales) and then compute the most symmetric possible way to achieve those conserved quantities. In fact, thinking about averages this way suggests that there is yet more to the story than I have digested so far. For instance, one of my absolute favorite results in mathematics - Noether’s theorem - posits a deep connection between conservation laws and symmetries in physical systems. Does this have any bearing on the concept of an average?
Appendix: the $F_1$ score
I will conclude this post by discussing the problem which first got me thinking about averages: the mysterious $F_1$ score in machine learning. The setup is simple: you have an algorithm which attempts to classify observations into one or more categories - for instance, perhaps your algorithm tries to determine whether or not a movie review is positive - and you wish to measure how well your algorithm performs.
There are two metrics which are often used:
- The precision is the probability that a review marked as positive by the algorithm really is positive.
- The recall is the probability that a review which is actually positive will be marked as such by the algorithm.
An algorithm with high precision but low recall is reluctant to make a guess, but when it does guess it is almost always correct; this would be useful if, for instance, you wanted to put a sample of positive reviews on your website for advertising purposes. An algorithm with high recall but low precision tries to capture the full spectrum of positive reviews even if it is a bit error prone; this is more useful for assessing how well a given movie was received overall.
But of course in most cases one hopes for an algorithm whose precision and recall are both high, and so it is desirable to compute an average. What average should we choose?
Observe that precision and recall are both “correctness probabilities”, so their average should be a single probability which achieves the same overall amount of correctness per review. In other words, this probability should be the ratio of “total correctness” to “number of reviews”.
It is clear what “total correctness” should mean: it should be the number $C$ of positive reviews which are correctly marked as such by the algorithm. But “number of reviews” is a bit tricker: if we want to bias our metric in favor of precision then we should pick something close to the number $A$ of reviews marked as positive by the algorithm, while if we are more interested in recall then we should stay close to the number $P$ of actually positive reviews. This line of thinking suggests that the “neutral” choice is to take the arithmetic mean of $A$ and $P$, namely $\frac{1}{2}(A + P)$. With this choice the average correctness is:
\[\frac{C}{\frac{1}{2}(A + P)}\]Let us express this in terms of the precision $p = \frac{C}{A}$ and recall $r = \frac{C}{P}$. We have:
\[\frac{C}{\frac{1}{2}(A + P)} = \frac{2}{\frac{A}{C} + \frac{P}{C}} = \frac{2}{\frac{1}{p} + \frac{1}{r}}\]In other words, the natural notion of average for precision and recall is simply their harmonic mean. In the machine learning literature this is called the $F_1$ score.
References
My thinking on this issue was heavily influenced by this question on stats.stackexchange.com.