Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.
翻译:针对个体未来的预测任务天生带有噪声,往往会产生多个准确度相近的模型。当这些模型对同一对象产生不同预测时,便引发了决策任意性的担忧。这种任意性在理论与实践中有多严重?如何在高风险评估场景中解决这一问题?我们通过研究一个已使用逾15年的机器学习辅助再犯风险评估决策支持系统来探讨这些问题。通过将复杂的法律规则转化为标注释放后结果(再犯者或非再犯者)的算法,我们首先构建了一个包含数千条囚犯释放记录的数据集。基于该数据集,我们学习可解释模型,在提升预测性能、降低群体间错误率差异的同时,确保矫正进展能够降低风险评分。接着研究预测多重性:首先推导数据集上任何有限模型集合预期预测一致性的严格下界,然后评估该集合内结构多样性(如不同模型系数)转化为预测多重性(即对同一对象产生不同预测)的程度。实验表明,存在多个准确度相近且错误率差异相当的模型并不必然导致严重的预测多重性。经验数据显示,性能相近的模型实际展现的预测一致性可能显著高于最坏情况下的理论保证。我们发现,为每名囚犯分配这些模型中最低风险的简单策略,能有效应对预测任意性问题。