Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes, failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further design an adaptation head for the visual encoder to aggregate salient attentive visual cues, which establishes discriminative and task-aware visual representations. We conduct extensive experiments on 11 widely-used benchmark datasets and the results verify the effectiveness in few-shot image classification, especially in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp method, KAPT exhibits favorable performance and achieves an absolute gain of 3.22% on new classes and 2.57% in terms of harmonic mean.
翻译:预训练的视觉语言模型(如CLIP)配合人工设计的提示,已展现出强大的迁移学习能力。然而,近期可学习提示虽取得最优性能,却容易过拟合于可见类别,难以泛化至未见类别。本文提出一种面向视觉语言模型的知识感知提示微调(KAPT)框架。该方法受人类智能启发——人类在识别新物体类别时通常会融入外部知识。具体而言,我们为文本编码器设计了两种互补的知识感知提示,以利用类别相关外部知识的独特特征:离散提示提取物体类别描述中的关键信息,而可学习的连续提示则捕获全局上下文。同时,我们为视觉编码器设计了自适应头,用于聚合显著性注意力视觉线索,从而建立具有判别性和任务感知的视觉表征。在11个广泛使用的基准数据集上的大量实验验证了该方法在少样本图像分类中的有效性,特别是在向未见类别泛化方面。与当前最优的CoCoOp方法相比,KAPT展现出优越性能,在新类别上实现了3.22%的绝对增益,调和均值提升了2.57%。