Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in multimedia research. In terms of vision-language (VL) tasks, they often serve as a language encoder and still require an additional fusion network for VL reasoning, resulting in excessive memory overhead. In this paper, we focus on exploring PLMs as a stand-alone model for VL reasoning tasks. Inspired by the recently popular prompt tuning, we first prove that the processed visual features can be also projected onto the semantic space of PLMs and act as prompt tokens to bridge the gap between single- and multi-modal learning. However, this solution exhibits obvious redundancy in visual information and model inference, and the placement of prompt tokens also greatly affects the final performance. Based on these observations, we further propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP). Concretely, DVP first deploys a cross-attention module to obtain text-related and compact visual prompt tokens, thereby greatly reducing the input length of PLMs. To obtain the optimal placement, we also equip DVP with a reinforcement-learning based search algorithm, which can automatically merge DVP with PLMs for different VL tasks via a very short search process. In addition, we also experiment DVP with the recently popular adapter approach to keep the most parameters of PLMs intact when adapting to VL tasks, helping PLMs achieve a quick shift between single- and multi-modal tasks. We apply DVP to two representative PLMs, namely BERT and T5, and conduct extensive experiments on a set of VL reasoning benchmarks including VQA2.0, GQA and SNLIVE. The experimental results not only show the advantage of DVP on efficiency and performance, but also confirm its superiority in adapting pre-trained language models to VL tasks.

翻译：预训练语言模型（PLMs）在多媒体研究中扮演着日益重要的角色。在视觉-语言（VL）任务中，它们通常作为语言编码器使用，仍需额外的融合网络进行VL推理，导致过多内存开销。本文聚焦于探索将PLMs作为VL推理任务的独立模型。受近期热门的提示调优启发，我们首先证明处理后的视觉特征也可投影到PLMs的语义空间，并作为提示标记弥合单模态与多模态学习之间的差距。然而，该方案在视觉信息与模型推理中存在明显冗余，且提示标记的放置位置也极大影响最终性能。基于这些观察，我们进一步提出面向PLMs的新型迁移学习方法——动态视觉提示（DVP）。具体而言，DVP首先部署跨注意力模块获取与文本相关的紧凑视觉提示标记，从而大幅缩短PLMs输入长度。为获得最优放置策略，我们同时为DVP配备基于强化学习的搜索算法，该算法可通过极短的搜索过程自动将DVP与PLMs融合以适配不同VL任务。此外，我们还结合近期热门的适配器方法实验DVP，在适配VL任务时保持PLMs大部分参数不变，助力PLMs实现单模态与多模态任务间的快速切换。我们将DVP应用于两种代表性PLMs（BERT和T5），并在包括VQA2.0、GQA和SNLIVE在内的多组VL推理基准上进行了广泛实验。实验结果不仅证明了DVP在效率与性能上的优势，更验证了其在将预训练语言模型适配至VL任务时的优越性。