Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa

Cohen's and Fleiss' kappa are well-known measures for inter-rater reliability. However, they only allow a rater to select exactly one category for each subject. This is a severe limitation in some research contexts: for example, measuring the inter-rater reliability of a group of psychiatrists diagnosing patients into multiple disorders is impossible with these measures. This paper proposes a generalisation of the Fleiss' kappa coefficient that lifts this limitation. Specifically, the proposed $\kappa$ statistic measures inter-rater reliability between multiple raters classifying subjects into one-or-more nominal categories. These categories can be weighted according to their importance, and the measure can take into account the category hierarchy (e.g., categories consisting of subcategories that are only available when choosing the main category like a primary psychiatric disorder and sub-disorders; but much more complex dependencies between categories are possible as well). The proposed $\kappa$ statistic can handle missing data and a varying number of raters for subjects or categories. The paper briefly overviews existing methods allowing raters to classify subjects into multiple categories. Next, we derive our proposed measure step-by-step and prove that the proposed measure equals Fleiss' kappa when a fixed number of raters chose one category for each subject. The measure was developed to investigate the reliability of a new mathematics assessment method, of which an example is elaborated. The paper concludes with the worked-out example of psychiatrists diagnosing patients into multiple disorders.

翻译：Cohen's kappa和Fleiss' kappa是评估评定者间可靠性的经典指标。然而，这些指标仅允许每位评定者为每个受试者选择一个类别。这在某些研究场景中存在严重局限性：例如，使用这些指标无法测量一组精神科医生将患者诊断为多种障碍的评定者间可靠性。本文提出一种突破该局限的Fleiss' kappa系数的推广方法。具体而言，所提出的$\kappa$统计量可测量多名评定者将受试者分类至一个或多个名义类别时的评定者间可靠性。这些类别可根据重要性加权，且该指标能考虑类别层次结构（例如：类别包含子类别，且子类别仅在选择主类别（如原发性精神障碍及其亚型）时可用；但类别间也可存在更复杂的依赖关系）。所提出的$\kappa$统计量能处理缺失数据以及受试者或类别对应的评定者数量变化的情况。本文简要概述了允许评定者将受试者分类至多个类别的现有方法，随后逐步推导提出的测量方法，并证明当固定数量评定者为每位受试者选择单一类别时，该指标等同于Fleiss' kappa。该指标是为评估一种新型数学测评方法的可靠性而开发的，文中对其示例进行了详细阐述。论文最后以精神科医生将患者诊断为多种障碍的完整实例作为收尾。