Recent large vision-language models such as CLIP have shown remarkable out-of-distribution (OOD) detection and generalization performance. However, their zero-shot in-distribution (ID) accuracy is often limited for downstream datasets. Recent CLIP-based fine-tuning methods such as prompt learning have demonstrated significant improvements in ID classification and OOD generalization where OOD labels are available. Nonetheless, it remains unclear whether the model is reliable to semantic shifts without OOD labels. In this paper, we aim to bridge the gap and present a comprehensive study to understand how fine-tuning impact OOD detection for few-shot downstream tasks. By framing OOD detection as multi-modal concept matching, we establish a connection between fine-tuning methods and various OOD scores. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. In particular, the maximum concept matching (MCM) score provides a promising solution consistently. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.
翻译:近期的大型视觉语言模型(如CLIP)在分布外(OOD)检测和泛化性能方面表现出色。然而,对于下游数据集,其零样本分布内(ID)分类精度往往有限。基于CLIP的微调方法(如提示学习)在ID分类和具备OOD标签的OOD泛化任务中已展现出显著提升。然而,在缺乏OOD标签的情况下,模型对语义偏移的可靠性仍不明确。本文旨在弥合这一差距,通过系统性研究探讨微调如何影响少样本下游任务中的OOD检测。通过将OOD检测构建为多模态概念匹配问题,我们建立了微调方法与各类OOD评分之间的理论关联。实验结果表明,选择合适的OOD评分对基于CLIP的微调至关重要。其中,最大概念匹配(MCM)评分展现出持续优异的性能。研究同时表明,提示学习方法在OOD检测任务中实现了超越零样本模型的先进性能。