This paper investigates the inter-rater reliability of risk assessment instruments (RAIs). The main question is whether different, socially salient groups are affected differently by a lack of inter-rater reliability of RAIs, that is, whether mistakes with respect to different groups affects them differently. The question is investigated with a simulation study of the COMPAS dataset. A controlled degree of noise is injected into the input data of a predictive model; the noise can be interpreted as a synthetic rater that makes mistakes. The main finding is that there are systematic differences in output reliability between groups in the COMPAS dataset. The sign of the difference depends on the kind of inter-rater statistic that is used (Cohen's Kappa, Byrt's PABAK, ICC), and in particular whether or not a correction of predictions prevalences of the groups is used.
翻译:本文探究了风险评估工具(RAIs)的评分者间信度。核心问题在于:不同社会显著性群体是否会因RAIs评分者间信度的缺失而受到差异化影响——即针对不同群体的评估误差是否会对其产生不同后果。我们通过对COMPAS数据集的仿真研究来探讨该问题。具体方法是在预测模型的输入数据中注入可控程度的噪声,该噪声可被视为模拟评分者所犯的错误。主要发现是COMPAS数据集中各群体间的输出可靠性存在系统性差异。差异方向取决于所使用的评分者间统计量(Cohen's Kappa、Byrt's PABAK、ICC),尤其取决于是否对群体预测流行率进行校正。