Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.
翻译:近年来,多模态训练的进展显著提升了图像理解与生成在统一模型中的整合能力。本研究探讨了视觉语言模型如何处理图像理解任务,特别聚焦于视觉信息如何被处理并传递至文本域。我们比较了同时生成图像和文本的视觉语言模型与仅输出文本的模型,揭示了信息流的关键差异。研究发现,在多模态输出模型中,图像和文本嵌入在残差流中更为分离。此外,模型在视觉信息到文本标记的交换方式上存在差异。仅输出文本的视觉语言模型表现出分布式通信模式,即信息通过多个图像标记进行交换。相比之下,为图像和文本生成而训练的模型则依赖于单个标记,该标记充当视觉信息的“窄门”。我们证明,消融这一单个标记会显著降低图像理解任务的性能。此外,修改此标记能够有效引导图像语义,这表明有针对性的局部干预可以可靠地控制模型的全局行为。