Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

from arxiv, Our code (https://github.com/DELTA-DoubleWise/OmniReason) and data (https://huggingface.co/datasets/ycwang11/OmniReason) are publicly available

Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

翻译：多模态大语言模型（MLLMs）通过整合文本、视觉和音频等多样输入，承诺能提升推理能力。然而，跨模态推理仍未被充分探索，关于新增模态是增强还是损害性能的研究报告存在矛盾。这些不一致源于缺乏受控评估框架，以及对模型内部机制的分析，难以孤立出模态交互何时及为何支持或削弱推理。我们通过一个基于逻辑的评估框架弥补这一空白，将多模态推理分为六种交互模式，这些模式根据事实在模态间的分布方式及逻辑组合方式而变化。实验表明：额外模态仅在提供独立且充分的推理路径时增强推理，而冗余或链式蕴涵支持通常损害性能。此外，推理以三种系统性方式退化：较弱模态拖累整体性能、冲突偏向于特定模态偏好、不同模态的联合信号无法有效整合。因此，我们识别出两个核心失败：任务组合瓶颈（识别与推理无法在一次前向传播中联合执行）和融合瓶颈（早期融合引入偏差）。进一步分析发现：注意力模式未能编码事实有效性，但简单的两步提示策略（先识别后推理）可恢复性能，证实了任务组合瓶颈的存在。此外，模态身份在早期层仍可恢复，软化早期融合中的注意力能改善推理，凸显出有偏融合是另一失败模式。总体而言，我们的发现表明：整合而非感知是多模态推理的主要障碍，这提示组合感知训练与早期融合控制是未来有前景的研究方向。