Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4, is resource-intensive. Regarding Large Language Models (LLMs) as the core processor for multimodal information, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external vision information. Previous methods incorporate visual information into LLMs with a simple visual mapping network or Q-former from BLIP-2. Such networks project the image feature once yet do not consider the interaction between the image and the human input query. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. LMEye addresses this issue by allowing the LLM to request the desired visual information aligned with various human instructions, which we term as the dynamic visual information interaction. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information interaction, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on some multimodal benchmarks, demonstrating that it significantly improves the zero-shot performance on various multimodal tasks compared to previous methods, with less parameters.
翻译:从头训练多模态大型语言模型(如GPT-4)资源消耗巨大。针对将大型语言模型(LLM)作为多模态信息核心处理器的场景,本文提出LMEye——一种类人眼且即插即用的交互式感知网络,旨在实现LLM与外部视觉信息的动态交互。以往方法通过简单的视觉映射网络或BLIP-2中的Q-Former将视觉信息整合至LLM。此类网络仅对图像特征进行单次投影,未考虑图像与人类输入查询之间的交互。因此,未关联人类意图的视觉信息(我们称之为静态视觉信息)可能不足以支撑LLM生成遵循意图的响应。LMEye通过允许LLM请求与各类人类指令对齐的所需视觉信息(我们称之为动态视觉信息交互)解决了这一问题。具体而言,LMEye包含一个简单的视觉映射网络,用于为LLM提供图像的基础感知。它还额外设有模块,分别负责获取LLM的请求、执行基于请求的视觉信息交互,以及将交互后的视觉信息传输至LLM。借此,LLM可理解人类查询,向基于请求的视觉信息交互模块发送相应请求,并基于交织的多模态信息生成响应。通过在多个多模态基准上开展大量实验,我们评估了LMEye的性能。结果表明,与先前方法相比,LMEye以更少的参数显著提升了各项多模态任务的零样本性能。