Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.
翻译:视觉指令调优能够有效将预训练的大语言模型适配为同时处理图像与文本信息。然而,视觉特征如何嵌入大语言模型骨干网络中层级的抽象层次仍不明确。通过分析多种视觉-语言架构,我们证明指令调优主要充当桥梁角色,直接将视觉特征注入大语言模型的中间语义层,绕过了早期用于单模态处理的层级。借助探测分析与因果干预实验,我们发现这些中间层是视觉-语言处理的语义核心,对多模态基准测试的整体性能至关重要。此外,通过比较语义等价的视觉与文本表征的几何结构,我们发现微调扩展并强化了现有抽象阶段,使视觉特征与预训练的文本特征对齐。最后,我们通过将微调限制在中间层的方法验证了这种局部对齐的功能性作用:该策略在保持全量微调在视觉中心基准测试中性能的同时,缩短了训练时间。我们的结果表明,多模态集成是通过复用大语言模型内部抽象引擎实现的局部化现象。