Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.
翻译:近期视觉-语言模型(VLMs)的研究在下游任务快速适应方面取得了显著成功。当应用于实际下游任务时,VLMs不可避免地会同时遇到分布内(ID)数据与分布外(OOD)数据。OOD数据集通常包含协变量偏移(例如已知类别但图像风格变化)与语义偏移(例如测试时未见过的类别)两种情况。这凸显了提升VLMs对协变量偏移OOD数据的泛化能力,同时有效检测开放集语义偏移OOD类别的重要性。本文受视觉-语言模态重新对齐过程中(通过直接将最大余弦相似度降至低值)闭集数据能量显著变化的启发,提出一种新型OOD评分指标——ΔEnergy。ΔEnergy显著优于基于原始能量的OOD评分,为OOD检测提供了更可靠的方法。此外,ΔEnergy可通过其下界最大化(称为EBM)同时提升模型在协变量偏移下的OOD泛化能力。理论证明EBM不仅能增强OOD检测,还能产生域一致的Hessian矩阵,这为OOD泛化提供了有力保障。基于此发现,我们开发了统一的微调框架,可同步提升VLMs在OOD泛化与OOD检测两方面的鲁棒性。在具有挑战性的OOD检测与泛化基准测试上的大量实验表明,本方法在AUROC指标上以10%至25%的优势超越现有最新方法。