Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.
翻译:视觉数学推理作为一项基础视觉推理能力,已受到大型多模态模型(LMMs)社区的广泛关注。现有基准(如MathVista和MathVerse)更侧重于结果导向的性能,却忽视了知识获取与泛化的内在原理。受类人数学推理的启发,我们提出了WE-MATH——首个专门为探索超越端到端性能的问题解决原理而设计的基准。我们精心收集并分类了6.5K个视觉数学问题,涵盖67个层次化知识概念和五层知识粒度。我们将复合问题根据所需知识概念分解为子问题,并引入一个新颖的四维度量标准,即知识不足(IK)、泛化不足(IG)、完全掌握(CM)与机械记忆(RM),以分层评估LMMs推理过程中的固有问题。借助WE-MATH,我们对现有LMMs在视觉数学推理方面进行了全面评估,揭示了求解步骤与问题特定性能之间的负相关关系。我们证实LMMs的IK问题可通过知识增强策略有效改善。更值得注意的是,GPT-4o的主要挑战已显著从IK转变为IG,使其成为首个迈向知识泛化阶段的LMM。相比之下,其他LMMs表现出明显的机械记忆倾向——它们能正确解决涉及多个知识概念的复合问题,却无法回答子问题。我们预计WE-MATH将为LMMs视觉数学推理的进步开辟新途径。WE-MATH数据与评估代码公开于https://github.com/We-Math/We-Math。