Improving out-of-distribution (OOD) generalization through in-distribution (ID) adaptation is a primary goal of robust fine-tuning methods beyond the naive fine-tuning approach. However, despite decent OOD generalization performance from recent robust fine-tuning methods, OOD confidence calibration for reliable machine learning has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and calibration error in Vision Language Models (VLMs). Firstly, we show that both types of errors have a shared upper bound consisting of two terms of ID data: 1) calibration error and 2) the smallest singular value of the input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further aided by the self-distillation of a moving averaged model to achieve well-calibrated prediction. Starting from an empirical validation of our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our method.
翻译:通过域内(ID)适应来提升域外(OOD)泛化能力,是超越朴素微调方法的鲁棒微调方法的主要目标。然而,尽管近期鲁棒微调方法取得了不错的OOD泛化性能,面向可靠机器学习的OOD置信度校准问题尚未得到充分解决。本文提出了一种鲁棒微调方法,旨在同时提升视觉语言模型(VLMs)的OOD准确率和校准误差。首先,我们证明这两类误差存在一个由ID数据两项组成的共同上界:1)校准误差,以及2)输入协方差矩阵的最小奇异值。基于这一洞见,我们设计了一个新颖的框架,该框架通过一个约束性多模态对比损失进行微调,以强制产生更大的最小奇异值,并进一步借助移动平均模型的自蒸馏来实现良好校准的预测。从对我们理论陈述的经验验证出发,我们在ImageNet分布偏移基准测试上提供了广泛的实验结果,证明了我们方法的有效性。