Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.
翻译:评估机器学习模型在部署规模下的失败频率是部署前安全评估的核心,但可行的评估集通常不足以观测到关键失败案例。Jones等人(2025)通过从评估集中最大k个失败分数进行外推来预测部署规模的失败率。我们给出该估计量预测误差的有限k分解,并证明其典型情况下存在对过度预测(即偏向安全方向)的内置偏差。当评估集遗漏部署集包含的罕见高失败模式时,该偏差会被抵消,导致预测在部署规模下出现欠预测。我们提出一种微调目标——可预测性损失函数,以应对该失败模式。在两项概念验证实验中(语言模型密码游戏与强化学习网格世界),微调在保持主任务能力并实现与监督基线相似安全性的同时,显著降低了保留数据集的预测误差。