Information Router for Mitigating Modality Dominance in Vision-Language Models

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

翻译：视觉语言模型（VLM）在各类基准测试中展现出强大性能，却常受模态主导性问题困扰，即模型预测过度依赖单一模态。现有方法主要通过引导模型注意力分配来应对该问题，隐式假设所有模态均能提供充足信息。然而，注意力机制仅决定模型关注位置，无法补充缺失或模糊的信息。现实世界中，不同模态的信息密度和信噪比往往存在差异。在此情况下，单纯调整注意力机制无法解决根本性的信息缺失问题。本文提出\textsc{MoIR}多模态信息路由器，这是一种信息级融合方法，可在信息融合前显式缩小信息差距。\textsc{MoIR}能够识别信息量较少的令牌，并从强模态路由补充信息，在大语言模型处理前构建信息密集的令牌表征。通过调整信息可用性，即便在单模态退化情况下，\textsc{MoIR}仍能实现模态主导性的可靠转移。我们在三个主流多模态基准测试中，基于多种模型骨干对\textsc{MoIR}进行评估。实验结果表明，\textsc{MoIR}能持续实现更均衡的模态贡献，尤其在模态退化条件下，显著提升了模型鲁棒性和下游任务性能。这些发现证明，显式修正跨模态信息是缓解多模态推理模型中模态主导性的有效互补策略。