We propose a "learning to reject" framework to address the problem of silent failures in Domain Generalization (DG), where the test distribution differs from the training distribution. Assuming a mild distribution shift, we wish to accept out-of-distribution (OOD) data whenever a model's estimated competence foresees trustworthy responses, instead of rejecting OOD data outright. Trustworthiness is then predicted via a proxy incompetence score that is tightly linked to the performance of a classifier. We present a comprehensive experimental evaluation of incompetence scores for classification and highlight the resulting trade-offs between rejection rate and accuracy gain. For comparability with prior work, we focus on standard DG benchmarks and consider the effect of measuring incompetence via different learned representations in a closed versus an open world setting. Our results suggest that increasing incompetence scores are indeed predictive of reduced accuracy, leading to significant improvements of the average accuracy below a suitable incompetence threshold. However, the scores are not yet good enough to allow for a favorable accuracy/rejection trade-off in all tested domains. Surprisingly, our results also indicate that classifiers optimized for DG robustness do not outperform a naive Empirical Risk Minimization (ERM) baseline in the competence region, that is, where test samples elicit low incompetence scores.
翻译:我们提出了一种“学会拒绝”框架,以解决领域泛化中测试分布与训练分布不同时的静默失败问题。假设存在温和的分布偏移,我们希望在模型估计的能力可预见可信响应时接受分布外数据,而非直接拒绝所有分布外数据。信任度通过一个与分类器性能紧密相关的代理能力不足分数来预测。我们针对分类任务中的能力不足分数进行了全面的实验评估,并强调了拒绝率与准确率提升之间的权衡。为与先前工作保持可比性,我们聚焦于标准领域泛化基准,并探讨在封闭与开放世界场景下通过不同学习表示测量能力不足的影响。结果表明,能力不足分数的增加确实能预测准确率的下降,从而在适当的能力不足阈值下显著提升平均准确率。然而,这些分数尚不足以在所有测试领域实现有利的准确率/拒绝率权衡。令人意外的是,我们的结果还显示,针对领域泛化鲁棒性优化的分类器在能力区域(即测试样本引发低能力不足分数的区域)内并未优于朴素的经验风险最小化基线。