The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.

翻译：近年来，多模态训练的进展显著提升了图像理解与生成在统一模型中的整合能力。本研究探讨了视觉语言模型如何处理图像理解任务，特别聚焦于视觉信息如何被处理并传递至文本域。我们比较了同时生成图像和文本的视觉语言模型与仅输出文本的模型，揭示了信息流的关键差异。研究发现，在多模态输出模型中，图像和文本嵌入在残差流中更为分离。此外，模型在视觉信息到文本标记的交换方式上存在差异。仅输出文本的视觉语言模型表现出分布式通信模式，即信息通过多个图像标记进行交换。相比之下，为图像和文本生成而训练的模型则依赖于单个标记，该标记充当视觉信息的“窄门”。我们证明，消融这一单个标记会显著降低图像理解任务的性能。此外，修改此标记能够有效引导图像语义，这表明有针对性的局部干预可以可靠地控制模型的全局行为。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/