To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.
翻译:为推进大型多模态模型(LMMs)在多模态数学推理方面的评估,本文提出了一个新颖的基准测试——MM-MATH。MM-MATH包含5,929个具有视觉背景的开放式中学数学问题,并依据难度、年级和知识点进行了细粒度分类。与现有依赖二元答案对比的基准不同,MM-MATH同时纳入了结果评估和过程评估。过程评估采用LMM-as-a-judge方法,自动分析解题步骤,识别错误并将其归类为特定的错误类型。在MM-MATH上对十个模型进行的广泛评估表明,现有LMMs面临重大挑战,突显了它们对视觉信息利用的局限性以及在处理高难度问题时的困难。表现最佳的模型在MM-MATH上的准确率仅为31%,而人类准确率达到82%。这凸显了我们的基准对现有模型具有挑战性,以及当前模型与人类在多模态推理能力方面存在的显著差距。我们的过程评估表明,图表误解是最常见的错误,占总错误案例的一半以上,这强调了在多模态推理中改进图像理解能力的必要性。