Parameter efficient transfer learning (PETL) is an emerging research spot that aims to adapt large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage and computation costs. However, these methods do not take into account instance-specific visual clues for visual tasks. In this paper, we propose a Dynamic Visual Prompt Tuning framework (DVPT), which can generate a dynamic instance-wise token for each image. In this way, it can capture the unique visual feature of each image, which can be more suitable for downstream visual tasks. We designed a Meta-Net module that can generate learnable prompts based on each image, thereby capturing dynamic instance-wise visual features. Extensive experiments on a wide range of downstream recognition tasks show that DVPT achieves superior performance than other PETL methods. More importantly, DVPT even outperforms full fine-tuning on 17 out of 19 downstream tasks while maintaining high parameter efficiency. Our code will be released soon.
翻译:参数高效迁移学习(PETL)是新兴的研究热点,旨在将大规模预训练模型适配至下游任务。现有方法在节省存储与计算成本方面已取得显著成功,但未能针对视觉任务中实例特有的视觉线索进行建模。本文提出动态视觉提示调优框架(DVPT),可为每张图像动态生成实例专属的提示令牌。该方法通过捕捉每张图像的独有视觉特征,从而更适配下游视觉任务。我们设计了元网络模块,该模块可基于每张图像生成可学习的提示,进而捕获动态的实例级视觉特征。在广泛的下游识别任务上的大量实验表明,DVPT 的性能优于其他 PETL 方法。更重要的是,DVPT 在 19 个下游任务中的 17 个上甚至超越了全参数微调方法,同时保持了高参数效率。我们的代码将很快发布。