Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper proposes an alternative method called LMEye, a play-plug-in Interactive Perception Network for Large Language Models (LLMs), aiming to improve the accuracy of image understanding for the LVLM. Previous methods that infuse visual information into LLMs utilize a static visual mapping network, but lack dynamic interaction between the LLMs and visual information. LMEye addresses this issue by allowing the LLM to incorporate the visual information that aligned with human instruction. Specifically, the LMEye network consists of a static visual mapping network to provide the basic perception of an image to LLMs. Then, it also contains additional linear layers responsible for acquiring requests from LLMs, decomposing image features, and transmitting the interleaved information to LLMs, respectively. In this way, LLMs act to be in charge of understanding human instructions, sending it to the interactive perception network, and generating the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal question answering and reasoning tasks, demonstrating that it significantly improves the zero-shot performance of LLMs on multimodal tasks compared to previous methods.
翻译:从头训练像GPT-4这样的大型视觉语言模型(LVLM)需要大量资源。本文提出一种替代方法——LMEye,即面向大语言模型(LLM)的即插即用交互式感知网络,旨在提升LVLM的图像理解精度。以往将视觉信息注入LLM的方法采用静态视觉映射网络,但缺乏LLM与视觉信息之间的动态交互。LMEye通过允许LLM整合与人类指令对齐的视觉信息解决了该问题。具体而言,LMEye网络包含一个静态视觉映射网络,用于向LLM提供图像的基础感知;同时包含额外的线性层,分别负责获取LLM的请求、分解图像特征,并将交错信息传输至LLM。通过这种方式,LLM负责理解人类指令、将其发送至交互式感知网络,并基于交错的多模态信息生成响应。我们在多模态问答与推理任务上进行了广泛实验,结果表明,与先前方法相比,LMEye能显著提升LLM在多模态任务上的零样本性能。