The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.
翻译:视觉-语言模型(如CLIP)的兴起推动了将其应用于下游监督学习任务的重要研究。尽管已有部分工作探索了CLIP的无监督微调,但这些方法通常依赖与真实标签相关联的类别名称形式的先验知识。本文通过假设未标注数据可能包含来自未知类别的分布外样本,深入探讨了面向真实场景的无监督微调问题。我们进一步强调,在识别预定义类别标签对应样本的同时,同步提升分布外检测能力的重要性。针对该问题,我们提出一种名为通用熵优化(UEO)的简洁、高效且有效的微调方法。UEO利用样本级置信度,近似最小化置信实例的条件熵,同时最大化低置信实例的边际熵。除优化文本提示外,UEO还整合了对CLIP视觉分支中通道级仿射变换的优化。通过在15个领域和4种不同先验知识类型上的大量实验,我们证明UEO在泛化能力和分布外检测性能上均优于基线方法。