To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine granularity levels. This approach significantly enhances the ability of LLMs to understand and interpret visual content. Experiments on various visual question answering tasks, such as counting and position reasoning, demonstrate the effectiveness of prompt-aware adapters.
翻译:为弥合视觉与语言模态之间的鸿沟,多模态大语言模型(MLLMs)通常通过学习一个适配器,将视觉输入转换为大语言模型(LLMs)可理解的标记。然而,大多数适配器生成的是固定不变的视觉标记,而忽略了提示中提及的特定关注对象。由于这些适配器对图像中的每个细节均分配同等注意力并聚焦于整个场景,它们可能会增加LLMs的认知负荷,尤其在处理复杂场景时。为缓解此问题,我们提出了提示感知适配器。该适配器被设计为能够根据提示的具体关注点动态嵌入视觉输入。具体而言,提示感知适配器同时利用全局与局部文本特征,在粗粒度和细粒度两个层面捕捉提示中最相关的视觉线索。这种方法显著增强了LLMs理解和解释视觉内容的能力。在计数、位置推理等多种视觉问答任务上的实验验证了提示感知适配器的有效性。