As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.
翻译:随着基于Transformer的模型规模持续增长,针对新任务微调这些大规模预训练视觉模型的参数需求日益增加。参数高效学习已被开发用于减少微调过程中可调参数的数量。尽管这些方法展现出令人期待的结果,但与全参数微调相比仍存在显著性能差距。为应对这一挑战,我们提出了一种高效且有效的视觉提示微调方法(E^2VPT),用于大规模基于Transformer的模型适配。具体而言,我们在自注意力层和输入层分别引入一组可学习的键值提示和视觉提示,以提升模型微调的有效性。此外,我们设计了一种提示剪枝流程,在保持模型性能的同时系统性地剔除低重要性提示,从而大幅提升模型效率。实验结果表明,我们的方法在两个基准测试中显著优于多个最先进基线模型,且参数使用量极低(例如在VTAB-1k上仅占模型参数的0.32%)。我们的代码已开源在https://github.com/ChengHan111/E2VPT。