Probably Random

A blog by Paul Siegel

Annotator Agreement in Machine Learning Experiments

February 12, 2018

All of the pieces of your sandwich detection classifier are falling into place. You collected 20000 images from Instagram, and by parsing hashtags you separated them into two classes: those that contain a sandwich and those that don’t. You trained a convolutional neural net to distinguish between the two classes, and the initial results are looking good.

But a lot is riding on the performance of your sandwich classifier, and you have to conduct an experiment to rigorously measure its performance. So you hire two annotators to look through a sample of 1000 images and indicate whether or not each one contains a sandwich. Let us denote the annotators by A and B, and let us use the numbers 0 and 1 to denote the labels “does not contain a sandwich” and “does contain a sandwich”, respectively. Here is a possible table of results from the annotation experiment:

  A0 A1 Total
B0 400 90 490
B1 60 450 510
Total 460 540 1000


In other words, there were 400 images which both annotators agreed contained no sandwich, 90 images which annotator A thought contained a sandwich but annotator B did not, and so on.

One immediate observation we can make using this table is that the two annotators agreed on the presence or absence of a sandwich in 850 out of 1000 of the images, or 85% of the time. We can view this as a measure of the quality of the problem and/or the experiment: if annotator agreement is low then it suggests that the annotation task is too subjective for statistical modelling to be effective.

Unfortunately, the raw rate of annotator agreement is not a suitable measure of the quality of an experiment, and we’ll have to think a bit more carefully.

Disagreeing with Agreement

To highlight the subtleties involved, consider the extreme case where 99% of the images in the annotation set contain a sandwich. Suppose that the annotators are lazy, and instead of carefully searching for sandwiches in each image they assign the label “has a sandwich” to 99% of of the images chosen at random and “doesn’t have a sandwich” for the remaining 1%. Then the rate of agreement will invariably be extremely high - they can’t possibly disagree on more than 2% of the images - but the experiment could not possibly be useful since the annotators didn’t even look at the data.

But upon further reflection, considering the overall rate of agreement in isolation is a straightforward violation of statistical fundamentals. The overall rate is the sum of the rate of “sandwich agreement” (both annotators agree that the image contains a sandwich) and “non-sandwich agreement” (both annotators agree that the image does not contain a sandwich). It is natural - perhaps even expected - that these rates will vary somewhat independently; for instance, if the two annotators disagree on whether or not a hot dog is a sandwich then it reduces the number of sandwich agreements while the number of non-sandwich agreements remains the same. So combining the two rates into one overall rate, no matter how carefully it is done, risks losing information.

Of course these problems get more complicated and more severe when one considers experiments involving more than two categories or more than two annotators. The solution, in my opinion, is simply to report on the rates of agreement for all categories individually. I will explain how this can be done in good time, but I must first note that this is not actually standard practice. The standard approach is to tweak the overall rate of agreement by computing a so called $\kappa$ score and report on this score; I believe that $\kappa$ has a lot of problems, but I would be remiss if I did not give it a fair hearing.

The Kappa Konundrum

$\kappa$ statistics, first introduced in the form of the $\pi$ statistic by Scott in 1955, handles the thought experiment at the beginning of the previous section by comparing observed annotator behavior to the behavior of “random” annotators. The statistic has the form

\[\kappa = \frac{r - e}{1 - e}\]

where $r$ is the rate of agreement (the total number of agreements divided by the total number of annotations) and $e$ is the expected agreement. There are a various different notions of what expected agreement should mean - I will summarize some of the common definitions in a forthcoming post - but roughly speaking $e$ represents the probability that the observed level of annotator agreement would occur assuming the annotators were guessing independently and uniformly randomly. So the numerator $r - e$ in the formula for $\kappa$ represents how much more (or less) agreement is observed compared to random annotators, and the denominator $1 - e$ is there to normalize the expression so that $-1 \leq \kappa \leq 1$.

This does help control for the distortions observed in the thought experiment from the last section, but it is plagued by a variety of theoretical and practical concerns which have been reported in the literature. Details will be provided in a forthcoming post, but here is a summary:

In my brief literature review I was unable to find any serious responses to these objections, so my best guess is that $\kappa$ is a historical relic whose prevalence is explained by tradition rather than principles. It is also possible that it makes more sense in disciplines like pyschology or medicine than in machine learning.

Agreement by Category

Instead of trying to correct the overall agreement rate with ill-conceived annotator models, let us compute agreement rates for each category individually and, if necessary, summarize the experiment by aggregating these scores. This approach, advocated by Uebersax, has several advantages over the various $\kappa$ scores:

Also note that considering the categories separately still resolves the skewed category problem which motivated the development of the $\kappa$ score: if two annotators independently randomly apply the label “contains a sandwich” to 99% of the images then it is very unlikely that there will be much overlap in the 1% of the images that each annotator labels “does not contain a sandwich”. So while the agreement score for the dominant category will be very high, the experimenters will be tipped off by the fact that the agreement score for the minority category is extremely low.

The main advantage that $\kappa$ has over the category-by-category agreement scores is that it summarizes the reliability of the experiment with a single number rather than one number for each category. But if reducing the evaluation to a single value is a hard requirement then one can simply report one of the usual summary statistics for a collection of numbers: mean, median, etc. I recommend reporting the lowest rate of agreement as the overall summary statistic: if this number is small then at least part of the annotation task was subjective or confusing, and if it is high (close to 1) then annotator agreement was high across the board.

Calculations

Let us work though how to actually calculate the rate of annotator agreement for a given category. The algorithm described in this section is implemented in a companion notebook which the reader may wish to consult, but we shall also work through a detailed example by hand in this post.

We will assume that an unspecified number of annotators are tasked with applying one or more of $m$ possible categories to $n$ samples. We do not assume that the same annotators, or even the same number of annotators, are exposed to every sample.

Following Uebersax, we will define the rate of agreement for a category to be the number of agreements for that category divided by the number of “potential agreements”. Here an agreement consists of a pair of distinct annotators (unordered) together with a sample to which both annotators applied the given category. A potential agreement consists of a pair of distinct annotators together with a sample to which at least one of the annotators applied the given category.

(These definitions do not agree precisely with those of Uebersax, who seems to have counted permutations rather than combinations of pairs of distinct annotators. For our purposes, if exactly two annotators apply a given category to a given sample, it will count as one agreement.)

Label the categories with a number $j \in \br{1, \ldots, m}$ and label the samples with a number $k \in \br{1, \ldots, n}$. Let $c_{jk}$ denote the number of annotators who apply the category $j$ to the sample $k$, and let $c_k$ denote the total number of times the sample $k$ is annotated (i.e. the sum of the $c_{jk}$’s over all $j$).

Given this notation, the number of pairs of annotators who agreed that the category $j$ should be applied to the sample $k$ is given by the binomial coefficient:

\[\binom{c_{jk}}{2} = \frac{c_{jk}(c_{jk} - 1)}{2}\]

To compute the number of pairs of annotators at least one of whom applied the category $j$ to the sample $k$, subtract the number of pairs neither of whom applied $j$ to $k$ from the total number of pairs:

\[\begin{align*} \binom{c_k}{2} - \binom{c_k-c_{jk}}{2} &= \binom{c_k}{2} - \frac{(c_k-c_{jk})(c_k-c_{jk}-1)}{2} \\ &= \binom{c_k}{2} - \left(\frac{c_k(c_k-1)}{2} + \frac{c_{jk}(c_{jk} + 1)}{2} - c_k c_{jk}\right) \\ &= c_k c_{jk} - \binom{c_{jk}+1}{2} \end{align*}\]

Summing these formulas over $k$ gives the total number of agreements $A_j$ and the total number of “potential agreements” $P_j$ on the category $j$ across all samples:

\[A_j = \sum_k \binom{c_{jk}}{2}, \quad P_j = \sum_k c_{jk} c_k - \binom{c_{jk}+1}{2}\]

Finally, the rate of agreement for the category $j$ across all samples is:

\[R_j = \frac{A_j}{P_j}\]

This score can be reported as a summary statistic for the agreement of the annotators on category $j$.

Let us compute the agreement rates for the categories “contains a sandwich” and “does not contain a sandwich” in the example at the beginning of this post.

Among the 1000 samples, 400 of them were labelled “does not contain a sandwich” twice, 150 of them received this label once, and the remaining 450 received the label zero times. Thus the total number of agreements on the “does not contain a sandwich” category is given by:

\[A_0 = 400 \binom{2}{2} + 150 \binom{1}{2} + 450 \binom{0}{2} = 400\]

To count the potential agreements, note that $c_k = 2$ for all $k$ since each sample is categorized by both annotators, so we have:

\[\begin{align*} P_0 &= 400 \left(2 \cdot 2 - \binom{3}{2}\right) + 150 \left(1 \cdot 2 - \binom{2}{2}\right) + 450 \left(0 \cdot 2 - \binom{1}{2}\right) \\ &= 400 + 150 \\ &= 550 \end{align*}\]

So the rate of agreement on the “contains a sandwich” category is $R_0 = \frac{400}{550} \approx 0.73$.

Similar calculations yield the rate of agreement for the other category:

\[A_1 = 400 \binom{0}{2} + 150 \binom{1}{2} + 450 \binom{2}{2} = 450\]

and:

\[\begin{align*} P_1 &= 400 \left(0 \cdot 2 - \binom{1}{2}\right) + 150 \left(1 \cdot 2 - \binom{2}{2}\right) + 450 \left(2 \cdot 2 - \binom{3}{2}\right) \\ &= 150 + 450 \\ &= 600 \end{align*}\]

Thus the rate of agreement on the “does not contain a sandwich” category is $R_1 = \frac{450}{600} = .75$.