To measure the degree of agreement between two observers that independently classify $n$ subjects within $K$ categories, it is common to use different kappa type coefficients, the most common of which is the $κ_C$ coefficient (Cohen's kappa). As $κ_C$ has some weaknesses -such as its poor performance with highly unbalanced marginal distributions-, the $Δ$ coefficient is sometimes used, based on the $delta$ response model. This model allows us to obtain other parameters like: (a) the $α_i$ contribution of each $i$ category to the value of the global agreement $Δ=\sum α_i$; and (b) the consistency $\mathcal{S}_i$ in the category $i$ (degree of agreement in the category $i$), a more appropriate parameter than the kappa value obtained by collapsing the data into the category $i$. It has recently been shown that the classic estimator $\hatκ_C$ underestimates $κ_C$, having obtained a new estimator $\hatκ_{CU}$ which is less biased. This article demonstrates that something similar happens to the known estimators $\hatΔ$, $\hatα_i$, and $\hat{\mathcal{S}}_i$ of $Δ$, $α_i$ and $\mathcal{S}_i$ (respectively), proposes new and less biased estimators $\hatΔ_U$, $\hatα_{iU}$, and $\hat{\mathcal{S}}_{iU}$, determines their variances, analyses the behaviour of all estimators, and concludes that the new estimators should be used when $n$ or $K$ are small (at least when $n\leq 50$ or $K\leq 3$). Additionally, the case where one of the raters is a gold standard is contemplated, in which situation two new parameters arise: the $conformity$ (the rater's capability to recognize a subject in the category $i$) and the $predictivity$ (the reliability of a response $i$ by the rater).
翻译:为衡量两名观察者独立将$n$个对象划分到$K$个类别中的一致程度,通常使用不同类型的kappa系数,其中最常用的是$κ_C$系数(Cohen's kappa)。由于$κ_C$存在一些缺陷——例如在边缘分布高度不平衡时表现不佳——有时会采用基于$delta$响应模型的$Δ$系数。该模型使我们能够获得其他参数,例如:(a) 每个$i$类别对全局一致性$Δ=\sum α_i$值的贡献度$α_i$;(b) 类别$i$中的一致性$\mathcal{S}_i$(类别$i$内的一致程度),这是比通过将数据折叠到类别$i$而获得的kappa值更合适的参数。最近研究表明经典估计量$\hatκ_C$会低估$κ_C$,并已获得偏差更小的新估计量$\hatκ_{CU}$。本文证明类似情况也发生在$Δ$、$α_i$和$\mathcal{S}_i$的已知估计量$\hatΔ$、$\hatα_i$和$\hat{\mathcal{S}}_i$上,提出了偏差更小的新估计量$\hatΔ_U$、$\hatα_{iU}$和$\hat{\mathcal{S}}_{iU}$,确定了它们的方差,分析了所有估计量的表现,并得出结论:当$n$或$K$较小时(至少当$n\leq 50$或$K\leq 3$时)应使用新估计量。此外,本文还考虑了其中一名评估者为金标准的情况,此时会产生两个新参数:符合性(评估者识别类别$i$中对象的能力)和预测性(评估者给出$i$类响应的可靠性)。