Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.
翻译:提示学习已被设计为微调的替代方案,用于使视觉-语言(V-L)模型适应下游任务。先前的研究主要集中于文本提示,而针对V-L模型的视觉提示研究有限。现有的视觉提示方法要么性能平庸,要么训练过程不稳定,这表明视觉提示学习存在难度。本文提出了一种新的渐进式视觉提示(ProVP)结构,以加强不同层间提示的交互。更重要的是,我们的ProVP能够有效地将图像嵌入传播到深层,并部分类似于实例自适应提示方法。为缓解泛化性能下降,我们进一步提出了一种新的对比特征重构方法,以防止提示后的视觉特征严重偏离固定的CLIP视觉特征分布。结合两者,我们的方法(ProVP-Ref)在11个图像基准数据集上进行了评估,并在少样本和基础到新类设置下,在7/11的数据集上取得了最先进的结果。据我们所知,我们首次证明了在下游任务中,视觉提示在V-L模型中的性能优于先前的基于提示的方法。同时,这也意味着我们的ProVP-Ref展现了最佳的适应与泛化能力。