We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
翻译:本文对图像分类任务在真实世界数据集偏移下的校准现状进行了广泛研究。我们的工作为后处理和训练中校准技术的选择提供了重要见解,并为所有关注偏移下鲁棒校准的研究者提供了实用指南。我们在多个成像领域的八种不同分类任务上,针对各种自然偏移场景,比较了多种后处理校准方法及其与常见训练中校准策略(如标签平滑)的交互作用。研究发现:(i)同时应用熵正则化与标签平滑能在数据集偏移下产生最佳校准的原始概率;(ii)接触少量语义分布外数据(与任务无关)的后处理校准器在偏移下最具鲁棒性;(iii)近期专门针对提升偏移下校准性能的方法相比简单的后处理校准方法未必能带来显著改进;(iv)改善偏移下的校准性能常以牺牲分布内校准效果为代价。重要的是,这些结论同时适用于随机初始化的分类器以及基于基础模型微调的分类器,后者相比从头训练的模型始终表现出更好的校准特性。最后,我们对集成效应进行了深入分析,发现:(i)在集成前(而非集成后)应用校准对偏移下的校准更有效;(ii)对于集成模型,分布外数据接触会恶化分布内-偏移校准的权衡关系;(iii)集成仍然是提升校准鲁棒性的最有效方法之一,与基于基础模型的微调相结合可产生整体最优的校准结果。