Understanding and Improving Visual Prompting: A Label-Mapping Perspective

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.

翻译：我们重新审视并推进了视觉提示（VP）技术，这是一种用于视觉任务的输入提示方法。视觉提示可通过简单地将通用提示（即输入扰动模式）嵌入下游数据点，重新编程固定的预训练源模型以完成目标域中的下游任务。然而，即便在源类别与目标类别之间存在无规则标签映射（LM）的情况下，视觉提示为何仍能保持有效性这一问题仍尚不明确。受此启发，我们提出疑问：标签映射与视觉提示之间如何相互关联？又如何利用这种关系提升目标任务的准确率？我们深入探究了标签映射对视觉提示的影响，并给出肯定答案：更高质量的标签映射（通过映射精度与解释性评估）能够持续提升视觉提示的有效性——这与先前研究中忽视标签映射因素的现状形成鲜明对比。为优化标签映射，我们提出名为ILM-VP（基于迭代标签映射的视觉提示）的新框架，该框架能自动将源标签重新映射至目标标签，并逐步提升视觉提示在目标任务上的准确率。此外，当使用对比语言-图像预训练（CLIP）模型时，我们提出集成标签映射过程以辅助CLIP的文本提示选择，从而提升目标任务准确率。大量实验表明，我们的方法显著优于现有最优的视觉提示方法。重点如下：在将ImageNet预训练的ResNet-18重新编程至13个目标任务时，我们的方法在迁移学习到目标Flowers102和CIFAR100数据集上分别以7.9%和6.7%的准确率提升大幅超越基线方法。同时，基于CLIP的视觉提示方案在Flowers102和DTD数据集上分别实现13.7%和7.1%的准确率提升。我们的代码已开源至https://github.com/OPTML-Group/ILM-VP。