Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters as it is easily obtainable from the observed ratings themselves. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. In this work, we outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a quantile approximation which allows us to estimate the probability of correctly selecting the best applicants and compute error probabilities of the selection procedure (i.e., false-positive and false-negative rate) under the assumption of the ratings' validity. If the ratings are not completely valid, the computed error probabilities correspond to a lower bound on the true error probabilities. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the quantile approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures.
翻译:评分者间信度(IRR)是评估多位评分者评分质量的常用工具之一,因其可直接从观测评分中获取而具有便利性。然而,基于多位评分者评分的申请人选择程序通常会产生二元结果:申请人被选中或未被选中。IRR并未考虑这一最终结果,而是专注于个体受试者或对象的评分。在本研究中,我们概述了评分测量模型(用于IRR)与二元分类框架之间的联系。我们开发了一种分位数近似方法,能够在假设评分有效的前提下,估计正确选择最优秀申请人的概率,并计算选择程序的错误概率(即假阳性率和假阴性率)。若评分并非完全有效,计算出的错误概率对应于真实错误概率的下界。我们建立了评分者间信度与二元分类指标之间的联系,表明二元分类指标仅取决于IRR系数和选中的申请人比例。我们通过模拟研究评估了分位数近似方法的性能,并在一个比较多项资助同行评审选择程序可靠性的示例中进行了应用。