LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

翻译：摘要：视觉-语言模型在广泛的理解与推理任务中取得了显著进展，这得益于以多模态融合为目标的大规模图文训练。理想情况下，将文本问题替换为其对应的渲染图像版本后，模型性能应基本保持不变。然而在实践中，这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中存在的固有用语偏差。在图像描述、VQA、OCR及网络来源的交错数据等常见数据集中，文本与图像通常被组织成截然不同且不对称的角色：文本作为语言查询，图像作为视觉参考。这种数据偏差导致视觉-语言模型对不同模态的信息获取表现出差异性偏好。因此，模型无法将语义相同的内容在不同文本与视觉载体上的表征对齐，使得模型推理在模态替换下变得脆弱。为解决这一问题，我们提出局部模态替换（LoMo）——一种轻量级、架构无关的数据整理范式，旨在为语义等价的文本与图像载体之间的跨模态表征不变性提供监督信号。LoMo通过将单模态提示重构为无缝交错的多模态序列实现这一目标：它动态选择目标文本片段并将其替换为渲染图像，从而在“文本-视觉-文本”载体间保持相同语义。在13个多样化的多模态基准上的大量实验表明，LoMo显著提升了整体多模态推理能力，并实现了更深层的跨模态融合。具体而言，它在基础模型上带来了一致性增益：在LLaVA-OneVision-1.5-8B上比标准SFT提升2.67分，在Qwen3.5-9B上提升2.82分。