Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods. Code: https://github.com/JieShibo/MemVP

翻译：当前高效构建大型视觉-语言（VL）模型的解决方案遵循两步范式：将预训练视觉编码器的输出投影到预训练语言模型的输入空间作为视觉提示；随后通过端到端参数高效微调（PEFT）将模型迁移至下游VL任务。然而，该范式仍存在效率不足的问题，因其显著增加了语言模型的输入长度。本文中，与将视觉提示集成到输入不同，我们将视觉提示视为辅助语言模型处理视觉信息相关任务的附加知识。基于前馈网络（FFN）充当"键值记忆"的发现，我们提出一种名为记忆空间视觉提示（MemVP）的新方法，其中视觉提示与FFN的权重进行拼接以实现视觉知识注入。在多种VL任务与语言模型上的实验结果表明，MemVP显著降低了微调VL模型的训练时间与推理延迟，并超越了此前PEFT方法的性能表现。代码地址: https://github.com/JieShibo/MemVP

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/