With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.
翻译:随着CLIP等大规模预训练视觉-语言模型的出现,通过提示调优可将可迁移表征适配到广泛的各类下游任务中。提示调优旨在从预训练模型存储的通用知识中提取对下游任务有益的信息。近期提出的上下文优化(CoOp)方法从语言端引入一组可学习向量作为文本提示。然而,仅调优文本提示只能调整合成的“分类器”,而图像编码器计算的视觉特征无法受到影响,从而导致次优解。本文提出一种新颖的双模态提示调优(DPT)范式,通过同时学习文本和视觉提示实现优化。为使最终图像特征更聚焦于目标视觉概念,我们在DPT中进一步提出类别感知视觉提示调优(CAVPT)方案——通过文本提示特征与图像块标记嵌入之间的交叉注意力动态生成类别感知视觉提示,从而同时编码下游任务相关信息与视觉实例信息。在11个数据集上的大量实验结果证明了所提方法的有效性与泛化能力。我们的代码可在https://github.com/fanrena/DPT获取。