Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained models to downstream tasks by adding task-specific tokens. In terms of vision-language pre-trained (VLP) models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks, which greatly exacerbates the already high computational overhead. In this paper, we revisit the principle of prompt tuning for Transformer-based VLP models, and reveal that the impact of soft prompt tokens can be actually approximated via independent information diffusion steps, thereby avoiding the expensive global attention modeling and reducing the computational complexity to a large extent. Based on this finding, we propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning. To validate APT, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of downstream tasks. Meanwhile, the generalization of APT is also validated on CLIP for image classification and StableDiffusion for text-to-image generation. The experimental results not only show the superior performance gains and computation efficiency of APT against the conventional prompt tuning methods, e.g., +7.01% accuracy and -82.30% additional computation overhead on METER, but also confirm its merits over other parameter-efficient transfer learning approaches.
翻译:提示调优是一种参数高效的方式,通过添加任务特定标记将大规模预训练模型部署到下游任务。对于视觉-语言预训练(VLP)模型,提示调优通常需要大量可学习标记来弥合预训练与下游任务之间的差距,这会极大地加剧本已高昂的计算开销。在本文中,我们重新审视了基于Transformer的VLP模型的提示调优原理,并揭示了软提示标记的影响实际上可以通过独立的信息扩散步骤来近似,从而避免了昂贵的全局注意力建模,并在很大程度上降低了计算复杂度。基于这一发现,我们提出了一种新颖的近似提示调优(APT)方法,用于高效的视觉语言迁移学习。为了验证APT,我们将其应用于两个代表性的VLP模型,即ViLT和METER,并在多个下游任务上进行了大量实验。同时,APT的泛化性也在CLIP的图像分类和StableDiffusion的文本到图像生成任务中得到了验证。实验结果表明,与传统提示调优方法相比,APT不仅表现出优越的性能提升和计算效率,例如在METER上实现了+7.01%的准确率和-82.30%的额外计算开销降低,而且证实了其相较于其他参数高效迁移学习方法的优势。