This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our method includes two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable component. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two "old tricks" commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method sets a new record of 82.8% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.6%. It is worth noting that this prompting performance already outperforms linear probing by +2.1% and can even match fully fine-tuning in certain datasets. In addition, our prompting method shows competitive performance across different data scales and against distribution shifts. The code is publicly available at https://github.com/UCSC-VLAA/EVP.
翻译:本文提出一种简单而有效的视觉提示方法,用于将预训练模型适配至下游识别任务。该方法包含两个关键设计:首先,我们不直接将提示与图像相加,而是将提示视为一个额外的独立可学习组件。研究表明,协调提示与图像的策略至关重要,并发现将提示环绕于适当缩小的图像周围这一做法在实践中效果最佳。其次,我们将构建可迁移对抗样本常用的两种"旧技巧"——输入多样性与梯度归一化——重新引入视觉提示领域。这些技术能够优化学习过程,并增强提示的泛化能力。我们通过大量实验验证了方法的有效性。基于CLIP模型,我们的提示方法在12个主流分类数据集上创下82.8%的平均准确率新纪录,较现有最优方法提升+5.6%。值得注意的是,该提示性能已超越线性探测法2.1%,在某些数据集中甚至可与完全微调相匹敌。此外,本方法在不同数据规模及分布偏移场景下均展现出竞争力的表现。代码已开源至 https://github.com/UCSC-VLAA/EVP。