Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts, confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models, with a median relative MAE improvement of 27% over the best baseline across all tasks, and SOTA performance on 10 out of 13 tasks. Our code is publicly available at https://github.com/melanibe/distance_matters_performance_estimation.
翻译:协变量偏移下的性能估计是安全AI模型部署的重要组成部分,特别是在敏感应用场景中。近期,已有多种解决方案被提出以应对该问题,其中多数方法利用模型预测结果或softmax置信度来推导准确率估计。然而,在数据集发生偏移时,若测试样本与训练分布距离过远,置信度评分可能出现校准不良。本研究表明,将测试样本与其预期训练分布的距离纳入考量,可显著提升协变量偏移下的性能估计效果。具体而言,我们引入"距离校验"机制来标记距离预期分布过远的样本,从而在准确率估计步骤中避免依赖其不可信的模型输出。我们在13项图像分类任务中验证了该方法的有效性,涵盖广泛的自然与合成分布偏移场景及数百个模型,所有任务的中位相对MAE相较于最优基线改进27%,并在13项任务中的10项达到最优性能(SOTA)。我们的代码已开源至https://github.com/melanibe/distance_matters_performance_estimation。