Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks. It leverages extra learnable tokens, known as prompts, which steer the frozen pretrained ViTs. Although VPT has demonstrated its applicability with supervised vision transformers, it often underperforms with self-supervised ones. Through empirical observations, we deduce that the effectiveness of VPT hinges largely on the ViT blocks with which the prompt tokens interact. Specifically, VPT shows improved performance on image classification tasks for MAE and MoCo v3 when the prompt tokens are inserted into later blocks rather than the first block. These observations suggest that there exists an optimal location of blocks for the insertion of prompt tokens. Unfortunately, identifying the optimal blocks for prompts within each self-supervised ViT for diverse future scenarios is a costly process. To mitigate this problem, we propose a simple yet effective method that learns a gate for each ViT block to adjust its intervention into the prompt tokens. With our method, prompt tokens are selectively influenced by blocks that require steering for task adaptation. Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation. The code is available at https://github.com/ryongithub/GatedPromptTuning.
翻译:视觉提示微调(VPT)是一种面向下游任务适配预训练视觉Transformer(ViTs)的有效微调方法。该方法利用额外可学习的提示令牌(prompts)来引导冻结的预训练ViTs。尽管VPT已在有监督视觉Transformer中展现出适用性,但在自监督Transformer中往往表现欠佳。通过实验观察,我们发现VPT的有效性在很大程度上取决于提示令牌所交互的ViT模块。具体而言,当提示令牌插入到后续模块而非首个模块时,VPT在MAE和MoCo v3的图像分类任务中性能显著提升。这些观察表明,提示令牌存在最优插入位置。然而,针对不同未来场景,为每个自监督ViT确定最优提示模块位置的成本过高。为解决该问题,我们提出一种简单有效的方法:为每个ViT模块学习一个门控机制,以调节其对提示令牌的干预程度。通过该方法,提示令牌可根据任务适配需求,有选择性地接收需要调整的模块影响。在FGVC和VTAB图像分类以及ADE20K语义分割任务中,我们的方法显著优于VPT变体。代码已开源至https://github.com/ryongithub/GatedPromptTuning。