With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.
翻译:随着CLIP等大规模预训练视觉语言模型的出现,可通过提示调优将可迁移表征适配至广泛下游任务。提示调优旨在从预训练模型存储的通用知识中挖掘有利于下游任务的信息。近期提出的方法Context Optimization (CoOp) 从语言侧引入一组可学习向量作为文本提示。然而,仅调优文本提示只能调整合成"分类器",而图像编码器计算的视觉特征无法受影响,因此导致次优解。本文提出新颖的双模态提示调优(Dual-modality Prompt Tuning, DPT)范式,通过同时学习文本和视觉提示。为使最终图像特征更聚焦于目标视觉概念,我们进一步在DPT中提出类别感知视觉提示调优(Class-Aware Visual Prompt Tuning, CAVPT)方案,该方案通过文本提示特征与图像分片令牌嵌入之间的交叉注意力动态生成类别感知视觉提示,以同时编码下游任务相关信息与视觉实例信息。在11个数据集上的大量实验结果表明了所提方法的有效性与泛化能力。我们的代码已开源:https://github.com/fanrena/DPT。