Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that exposes this assumption by operationally decomposing performance into Perception, Reasoning, and Integration. Each problem is derived from a symbolic specification and accompanied by visual diagrams, text-only variants, multimodal questions, and targeted perceptual probes, enabling controlled measurement of each component. Using this decomposition, we show that common training strategies induce systematically different capability profiles that are invisible under aggregate accuracy. Reinforcement learning primarily improves perceptual grounding and robustness to diagram variation, while textual SFT yields gains through reflective reasoning. In contrast, as perception and reasoning improve, a growing fraction of remaining errors fall outside these components and are categorized as integration. These results suggest that apparent progress in multimodal reasoning reflects shifting balances among subskills rather than uniform advancement, motivating evaluation beyond scalar accuracy.
翻译:对多模态推理模型的评估通常被简化为单一的准确率分数,这隐含地将推理视为一种单一能力。我们提出了MathLens,一个教科书风格的几何问题基准,通过将性能操作性地分解为感知、推理和整合三个部分,揭示了这一假设。每个问题都源自符号化规范,并配有视觉图表、纯文本变体、多模态问题以及针对性的感知探测,从而实现对每个组件的受控测量。利用这种分解方法,我们发现常见的训练策略会引发系统性的不同能力分布,而这些分布在聚合准确率下是不可见的。强化学习主要提升了感知基础和对图表变化的鲁棒性,而文本监督微调则通过反思性推理带来增益。相比之下,随着感知和推理能力的提升,剩余错误中越来越大的比例超出了这些组件范畴,被归类为整合错误。这些结果表明,多模态推理表面上的进展反映的是子技能之间平衡的变化,而非均匀的全面进步,这促使我们需要超越标量准确率进行更全面的评估。