Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain. We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow. We find that in native multimodal VLMs, image and text embeddings are more separated within the residual stream. Moreover, VLMs differ in how visual information reaches text: non-native multimodal VLMs exhibit a distributed communication pattern, where information is exchanged through multiple image tokens, whereas models trained natively for joint image and text generation tend to rely on a single post-image token that acts as a narrow gate for visual information. We show that ablating this single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control.
翻译:近期多模态训练的进展显著提升了图像理解与生成在统一模型中的整合能力。本研究探讨视觉语言模型如何处理图像理解任务,重点关注视觉信息如何被处理并传递至文本域。我们比较了原生多模态视觉语言模型(从头开始在多模态数据上训练以同时生成文本和图像)与非原生多模态视觉语言模型(从预训练大语言模型适配而来或仅能生成文本),揭示了信息流的关键差异。我们发现,在原生多模态视觉语言模型中,图像与文本嵌入在残差流中更为分离。此外,视觉语言模型在视觉信息传递至文本的方式上存在差异:非原生多模态视觉语言模型表现出分布式通信模式,信息通过多个图像标记进行交换;而原生训练用于联合图文生成的模型则倾向于依赖单个后图像标记,该标记充当视觉信息的窄门。我们证明,消融这一单个标记会显著降低图像理解性能,而针对性的标记级干预能够以细粒度控制可靠地引导图像语义及下游文本生成。