Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods. Code: https://github.com/JieShibo/MemVP
翻译:当前高效构建大型视觉-语言(VL)模型的解决方案遵循两步范式:将预训练视觉编码器的输出投影到预训练语言模型的输入空间作为视觉提示;随后通过端到端参数高效微调(PEFT)将模型迁移至下游VL任务。然而,该范式仍存在效率不足的问题,因其显著增加了语言模型的输入长度。本文中,与将视觉提示集成到输入不同,我们将视觉提示视为辅助语言模型处理视觉信息相关任务的附加知识。基于前馈网络(FFN)充当"键值记忆"的发现,我们提出一种名为记忆空间视觉提示(MemVP)的新方法,其中视觉提示与FFN的权重进行拼接以实现视觉知识注入。在多种VL任务与语言模型上的实验结果表明,MemVP显著降低了微调VL模型的训练时间与推理延迟,并超越了此前PEFT方法的性能表现。代码地址: https://github.com/JieShibo/MemVP