Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper presents a play-and-plug module for Large Language Models (LLMs), namely Interactive Perception Network (IPN), aiming to achieve a LVLM by incorporating the image understanding capability into LLMs. Previous methods incorporate visual information into LLMs with a simple visual mapping network, where the image feature is projected into the embedding space of LLMs via a linear layer. Such mapping network projects the image feature once yet does not consider the interaction between the image and the human input query. Hence, the obtained visual information with no connections with human intention may be inadequate for LLMs to make intention-following responses, which we term as static visual information. IPN addresses this issue by allowing the LLM to request the desired visual information aligned with various human instructions, which we term as the dynamic interaction between the LLM and visual information. Specifically, IPN consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information interaction, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate IPN through extensive experiments on multimodal question answering, reasoning, and so on, demonstrating that it significantly improves the zero-shot performance of LVLMs on various multimodal tasks compared to previous methods.
翻译:从头训练类似于GPT-4的大型视觉语言模型(LVLM)需要大量资源。本文提出一种即插即用的大语言模型模块,即交互式感知网络(IPN),旨在通过将图像理解能力融入大语言模型来构建LVLM。现有方法采用简单的视觉映射网络将视觉信息引入大语言模型:通过线性层将图像特征投影到词嵌入空间。然而,这类映射网络仅一次性投影图像特征,未考虑图像与用户输入查询之间的交互。由此获取的视觉信息与人类意图缺乏关联,可能导致大语言模型难以生成符合意图的响应,我们将其定义为静态视觉信息。IPN通过允许大语言模型动态获取与不同人类指令相匹配的视觉信息来解决该问题,我们将此过程定义为大语言模型与视觉信息的动态交互。具体而言,IPN包含一个提供基础图像感知的简单视觉映射网络,以及三个附加模块:分别负责获取大语言模型的请求、执行基于请求的视觉信息交互、以及将交互后的视觉信息传输回大语言模型。通过这种方式,大语言模型能够理解人类查询,将对应请求传递给基于请求的视觉信息交互模块,并基于交织的多模态信息生成响应。我们在多模态问答、推理等任务上进行大量实验,结果表明相较于现有方法,IPN能显著提升LVLM在各类多模态任务上的零样本性能。