With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an MLLM's ability to comprehend and solve complex, multi-step quantitative problems. While existing benchmarks such as MathVista and MathVerse have advanced the evaluation of multimodal math proficiency, they primarily rely on digitally rendered content and fall short in capturing the complexity of real-world scenarios. To bridge this gap, we introduce MathScape, a novel benchmark focused on assessing MLLMs' reasoning ability in realistic mathematical contexts. MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images, closely reflecting the challenges encountered in practical educational settings. We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs. Our results show that even state-of-the-art models struggle with real-world math tasks, lagging behind human performance, highlighting critical limitations in current model capabilities. Moreover, we find that strong performance on synthetic or digitally rendered images does not guarantee similar effectiveness on real-world tasks. This underscores the necessity of MathScape in the next stage of multimodal mathematical reasoning.
翻译:随着多模态大语言模型的快速发展,评估其数学推理能力已成为日益重要的研究方向。特别是视觉-文本数学推理,它是衡量多模态大语言模型理解和解决复杂多步骤定量问题能力的关键指标。尽管现有的基准测试(如MathVista和MathVerse)在评估多模态数学能力方面取得了进展,但它们主要依赖数字渲染内容,未能充分捕捉真实世界场景的复杂性。为弥补这一差距,我们提出了MathScape——一个专注于评估多模态大语言模型在真实数学场景中推理能力的新型基准。MathScape包含1,369个高质量数学问题,并配以人工拍摄的真实世界图像,紧密反映了实际教育环境中遇到的挑战。我们对九个领先的闭源多模态大语言模型、三个参数量超过200亿的开源多模态大语言模型以及七个较小规模的多模态大语言模型进行了全面的多维度评估。我们的结果表明,即使是当前最先进的模型在处理真实世界数学任务时也面临困难,其表现落后于人类水平,这突显了当前模型能力的关键局限。此外,我们发现模型在合成或数字渲染图像上的强大性能并不能保证其在真实世界任务中具有同等效力。这强调了MathScape在多模态数学推理下一发展阶段中的必要性。