Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt's generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.
翻译:视觉语言模型(如CLIP)的提示调优涉及优化用于生成特定下游任务图像-文本对的文本提示。虽然手工设计或基于模板的提示通常适用于更广泛的未见类别,但它们在下游任务(即已见类别)中往往表现不佳。另一方面,可学习的软提示在下游任务中通常表现良好,但缺乏泛化能力。此外,先前研究主要集中在文本模态上,极少有研究尝试从视觉模态探索提示的泛化潜力。考虑到这些局限性,我们研究了如何通过提示调优同时获得具有竞争力的下游性能和泛化能力。研究表明,通过将软提示和手工提示视为文本模态的双重视图,并最大化它们的互信息,我们可以更好地集成任务特定和通用的语义信息。此外,为了生成更具表达力的提示,本研究引入了来自视觉模态的类级增强,从而对更广泛的未见类别展现出显著的鲁棒性。在多个基准测试上的广泛评估表明,所提出的方法在任务特定性能和通用能力方面均取得了具有竞争力的结果。