Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.

翻译：针对个体未来的预测任务天生带有噪声，往往会产生多个准确度相近的模型。当这些模型对同一对象产生不同预测时，便引发了决策任意性的担忧。这种任意性在理论与实践中有多严重？如何在高风险评估场景中解决这一问题？我们通过研究一个已使用逾15年的机器学习辅助再犯风险评估决策支持系统来探讨这些问题。通过将复杂的法律规则转化为标注释放后结果（再犯者或非再犯者）的算法，我们首先构建了一个包含数千条囚犯释放记录的数据集。基于该数据集，我们学习可解释模型，在提升预测性能、降低群体间错误率差异的同时，确保矫正进展能够降低风险评分。接着研究预测多重性：首先推导数据集上任何有限模型集合预期预测一致性的严格下界，然后评估该集合内结构多样性（如不同模型系数）转化为预测多重性（即对同一对象产生不同预测）的程度。实验表明，存在多个准确度相近且错误率差异相当的模型并不必然导致严重的预测多重性。经验数据显示，性能相近的模型实际展现的预测一致性可能显著高于最坏情况下的理论保证。我们发现，为每名囚犯分配这些模型中最低风险的简单策略，能有效应对预测任意性问题。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【博士论文】基于不确定性的可靠性：现代机器学习中的选择性预测与可信部署

专知会员服务

24+阅读 · 2025年8月14日

【博士论文】小型和大型模型的不确定性估计

专知会员服务

21+阅读 · 2025年7月11日

【MIT博士论文】从数据到模型，再回到数据：构建可预测且可靠的机器学习系统”

专知会员服务

23+阅读 · 2025年6月19日

《军事危机模拟中语言模型自由决策不一致性度量》

专知会员服务

22+阅读 · 2024年10月29日