Vision-Language Models (VLMs) like CLIP have demonstrated remarkable applicability across a variety of downstream tasks, including zero-shot image classification. Recently, the use of prompts or adapters for efficient transfer learning has gained significant attention for effectively adapting to downstream tasks. However, the roles of vision and text prompts, as well as adapters in terms of generalization and transfer difficulty, have been overlooked, limiting performance on unseen tasks. In this paper, we empirically analyze how VLMs behave when using vision and text prompts, adapters, and a combination of these components, marking a novel exploration by our study. Our observations find that utilizing vision prompts for class separability and text adapters for task adaptation is crucial for adaptability and generalizability. Moreover, to improve generalization across every domain, we propose an adaptive ensemble method that effectively combines the general knowledge of VLMs with task-specific knowledge according to transfer difficulty. Upon experimenting with extensive benchmarks, our method consistently outperforms all baselines, particularly on unseen tasks, demonstrating the effectiveness of our proposed approach.
翻译:视觉-语言模型(如CLIP)在下游任务中展现出显著的应用潜力,包括零样本图像分类。近年来,利用提示或适配器进行高效迁移学习的方法因能有效适应下游任务而备受关注。然而,视觉提示、文本提示以及适配器在泛化性和迁移难度方面的作用尚未被充分探究,这限制了模型在未见任务上的性能。本文通过实证分析,首次系统研究了视觉-语言模型在使用视觉提示、文本提示、适配器及其组合时的行为模式。研究发现,利用视觉提示实现类别可分性,以及文本适配器进行任务适应,对提升模型的适应性与泛化能力至关重要。此外,为改善各领域的泛化性能,我们提出一种自适应集成方法,该方法根据迁移难度有效融合视觉-语言模型的通用知识与任务特定知识。在多个基准数据集上的实验表明,我们的方法在各项任务中均显著优于所有基线方法,尤其在未见任务上表现突出,验证了所提方法的有效性。