Measurement of the interrater agreement (IRA) is critical in various disciplines. To correct for potential confounding chance agreement in IRA, Cohen's kappa and many other methods have been proposed. However, owing to the varied strategies and assumptions across these methods, there is a lack of practical guidelines on how these methods should be preferred even for the common two-rater dichotomous rating. To fill the gaps in the literature, we systematically review nine IRA methods and propose a generalized framework that can simulate the correlated decision processes behind the two raters to compare those reviewed methods under comprehensive practical scenarios. Based on the new framework, an estimand of "true" chance-corrected IRA is defined by accounting for the "probabilistic certainty" and serves as the comparison benchmark. We carry out extensive simulations to evaluate the performance of the reviewed IRA measures, and an agglomerative hierarchical clustering analysis is conducted to assess the inter-relationships among the included methods and the benchmark metric. Recommendations for selecting appropriate IRA statistics in different practical conditions are provided and the needs for further advancements in IRA estimation methodologies are emphasized.
翻译:评委间一致性(IRA)的测量在各学科中至关重要。为纠正IRA中潜在的机会一致性混杂效应,Cohen's kappa及多种其他方法已被提出。然而,由于这些方法采用的策略和假设各异,即便是常见的两评委二分评分情形,目前也缺乏关于如何优选这些方法的实用指南。为填补文献空白,我们系统回顾了九种IRA方法,并提出一个通用框架,该框架可模拟两评委背后的相关决策过程,从而在全面实际场景下对所回顾的方法进行比较。基于新框架,通过考虑"概率确定性"定义了"真实"机会校正IRA的估计目标,并将其作为比较基准。我们开展了大量模拟实验以评估所回顾IRA指标的性能,并通过凝聚层次聚类分析评估所纳入方法与基准指标之间的相互关系。针对不同实际情况,提供了选择合适IRA统计量的建议,并强调了IRA估计方法进一步发展的必要性。