Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper presents a play-and-plug module for Large Language Models (LLMs), namely Interactive Perception Network (IPN), aiming to achieve a LVLM by incorporating the image understanding capability into LLMs. Previous methods incorporate visual information into LLMs with a simple visual mapping network, where the image feature is projected into the embedding space of LLMs via a linear layer. Such mapping network projects the image feature once yet does not consider the interaction between the image and the human input query. Hence, the obtained visual information with no connections with human intention may be inadequate for LLMs to make intention-following responses, which we term as static visual information. IPN addresses this issue by allowing the LLM to request the desired visual information aligned with various human instructions, which we term as the dynamic interaction between the LLM and visual information. Specifically, IPN consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information interaction, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate IPN through extensive experiments on multimodal question answering, reasoning, and so on, demonstrating that it significantly improves the zero-shot performance of LVLMs on various multimodal tasks compared to previous methods.
翻译:从头训练类似GPT-4的大型视觉语言模型(LVLM)需要消耗大量计算资源。本文提出一种即插即用模块——交互式感知网络(IPN),旨在通过将图像理解能力融入大语言模型(LLM)来构建LVLM。现有方法通常采用简单的视觉映射网络将图像特征通过线性层投影至LLM的嵌入空间,这种一次性映射方式未考虑图像与人类输入查询之间的交互。由此获取的视觉信息缺乏与人类意图的关联(我们称之为静态视觉信息),可能导致LLM无法生成符合意图的响应。IPN通过允许LLM主动请求与不同人类指令相匹配的视觉信息(即LLM与视觉信息之间的动态交互)来解决该问题。具体而言,IPN包含:为LLM提供图像基础感知的简单视觉映射网络,以及分别负责获取LLM请求、执行基于请求的视觉信息交互、将交互后的视觉信息传输至LLM的附加模块。通过这种方式,LLM能够理解人类查询、向基于请求的视觉信息交互模块传递相应请求,并基于融合后的多模态信息生成响应。我们在多模态问答、推理等任务上对IPN进行充分评估,结果表明与现有方法相比,该方法能显著提升LVLM在各类多模态任务上的零样本性能。