Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
翻译:多模态大语言模型(MLLMs)展现出令人瞩目的推理能力,但在专业领域评估其性能仍具挑战性。STEM推理因其提供高度可验证的反馈而成为极具价值的测试场景,但现有基准常因模态冗余允许单模态捷径,且主要关注最终答案准确性,忽视了推理过程本身。为解决这一问题,我们提出StepSTEM:一个包含数学、物理、化学、生物学和工程学领域283个问题、用于细粒度评估MLLMs跨模态推理的研究生水平基准。StepSTEM通过严格的筛选流程构建,确保文本与视觉输入之间具有严格互补性。我们进一步提出通用的步骤级评估框架,适用于纯文本思维链和交错图文推理,利用动态规划将预测的推理步骤与多参考解答对齐。对多种模型的实验表明,当前MLLMs仍严重依赖文本推理,即使Gemini 3.1 Pro和Claude Opus 4.6也仅达到38.29%的准确率。这些结果凸显了真正跨模态STEM推理的巨大提升空间,并将StepSTEM定位为细粒度评估多模态推理的基准。源代码获取地址:https://github.com/lll-hhh/STEPSTEM。