Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
翻译:视觉语言模型旨在为大型语言模型扩展视觉能力,然而本研究发现了一个令人惊讶的现象:在纯文本任务上,视觉语言模型的表现可能优于其底层的大型语言模型,尤其是在长上下文信息检索任务中。为探究这一效应,我们构建了一个受控的合成检索任务,发现仅基于文本训练的Transformer模型在分布内准确率达到完美,但在分布外泛化上表现失败;而随后在图像标记化的同一任务上进行训练后,其纯文本分布外性能提升了近一倍。机制可解释性分析表明,视觉训练改变了模型的内部绑定策略:纯文本训练会鼓励位置捷径,而基于图像的训练则通过空间平移不变性破坏这些捷径,迫使模型采用更鲁棒的符号绑定机制——即使重新引入纯文本示例后,这种机制仍得以保持。我们进一步刻画了不同训练机制、视觉编码器和初始化条件下绑定策略的变化规律,并证明在预训练的大型语言模型向视觉语言模型过渡过程中会出现类似的转变。我们的研究结果表明,跨模态训练能够增强推理和泛化能力,即使对于单模态任务也是如此。