This post shows a few ways to measure data quality via inter-annotator agreement. Via this, it highlights the challenges to construct a high quality dataset.

Data labeling is the first step of the classical machine learning pipeline, in which that step is followed by modeling and evaluation. The success of the data labeling step is lies in two dimensions: quality and quantity of the resulting dataset. Quality is arguably more important than quantity: a small dataset with high quality is still useable, but a large dataset with low quality is generally less reliable to use. However, while quantity can be easily measured via the number of datapoints, quality is not so trivial.

When we talk about the quality of a dataset, it often means the correctness of the annotated labels. (There are other aspects such as representativeness, diversity, etc., which are not covered here.) However, if a data labeling project only has a team of annotators to use, there is no “authority team” to judge the annotations. So, how can we measure data correctness? A necessary condition for data correctness is the reliability of the annotation procedure. When a task is done right, the result should be right. And to measure reliability, a popular method is to used inter-annotator agreement (IAA). The key idea is that, when the labels have low “variance” across independent annotators, the end result can be approximately reproduced independently by a new team. Such a procedure is then deemed reliable, which suggests high data quality.

The Kappa/Alpha Family

We now come to the most popular class of metrics for IAA: the Kappa/Alpha Family. These metrics return a single number, which is generally calculated as follows:

\[A=\frac{A_o-A_e}{1-A_e}\]

where $A_o$ is the observed agreement and $A_e$ is the expected agreement by chance.

We also define the parallel disagreement form. First, let $D_o=1-A_o$ and $D_e=1-A_e$ — these are disagreement counterparts to $A_o$ and $A_e$. As such, $A$ can be rewritten as:

\[A=\frac{(1-Do)-(1-D_e)}{1-(1-D_e)} = \frac{D_e-D_o}{D_e} = 1-\frac{D_o}{D_e}\]

Different metrics define $A_o$ and $A_e$ differently.

First, consider Fleiss’s $\kappa$. In this metrics, $A_o$ is defined as follows:

\[A_o=\frac{1}{N}\sum_i\frac{\text{agreed pairs of annotations for item $i$}}{\text{pairs of annotations for item $i$}}=\frac{1}{N}\sum_i\frac{\sum_{k}{n_{ik}\choose 2}}=\frac{\sum_{i}\sum_{k}{n_{ik}(n_{ik}-1)}}{NC(C-1)}\]

where:

  • $N$ is the total number of items for annotating,
  • $C$ is the total number of annotators, and
  • $n_{ik}$ is the total number of times label $k$ is chosen for item $i$ (across annotators).

Then, $A_e$ is defined as:

\[A_e=\frac{\sum_k{n_k^2}}{N^2}\]

where $n_k$ is the number of times label $k$ is chosen across all items and annotators. This assumes that all items have the same set of labels. The expected agreement is not entirely a priori, but based on the $n_k$’s.

Next, another popular metric is Krippendoff’s $\alpha$. In this metric, disagreements are used instead of agreement. The idea is to introduce more precise measure of disagreement: $D_o$ is defined with varied distance instead of 1. Then, $D_e$ is defined as the expected distance between two annotations on the same item, given a set of $n_k$’s.

Finally, there are two other seemingly popular metrics. The third one is Cohen’s $\kappa$ — this is not the same as Fleiss’s $\kappa$. Artstein [1] recommended being specific about which Kappa is mentioned when talking about them. And the fourth one is Scott’s $\pi$ (which I haven’t read much about.)

A technical challenge of data labeling

So now, there are objective metrics to measure procedure reliability. The numbers will show that the procedure is either reliable or not, transparently. If a team has lots of money to do data labeling, but their annotation procedure is not reliable, there is no way to have a high IAA metric. Instead, they need to invest resources on thoroughly understanding the task, write a digestable and unambiguous set of guidelines for annotators, and train the annotators carefully.

Note that the Kappa/Alpha family is only applicable when a reliable way to tell apart different levels of disagreement (2 levels in Fleiss’s $\kappa$, and more than 2 in Krippendorf’s $\alpha$). Therefore, the Kappa/Alpha family is most naturally applied to tasks where labels are categorical, such as in binary and multi-way classification. On the other hand, free text annotation tasks such as translation, summarization, lack an automated way to measure IAA.

Reference

[1] Artstein, R. (2017). Inter-annotator Agreement. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_11