Advancing Geometric Problem Solving: A Comprehensive Benchmark for Multimodal Model Evaluation

In this work, we present the MM-MATH dataset, a novel benchmark developed to rigorously evaluate the performance of advanced large language and multimodal models - including but not limited to GPT-4, GPT-4V, and Claude - within the domain of geometric computation. This dataset comprises 5,929 meticulously crafted geometric problems, each paired with a corresponding image, aimed at mirroring the complexity and requirements typical of ninth-grade mathematics. The motivation behind MM-MATH stems from the burgeoning interest and significant strides in multimodal technology, which necessitates a paradigm shift in assessment methodologies from mere outcome analysis to a more holistic evaluation encompassing reasoning and procedural correctness. Despite impressive gains in various benchmark performances, our analysis uncovers a persistent and notable deficiency in these models' ability to parse and interpret geometric information accurately from images, accounting for over 60% of observed errors. By deploying a dual-focused evaluation approach, examining both the end results and the underlying problem-solving processes, we unearthed a marked discrepancy between the capabilities of current multimodal models and human-level proficiency. The introduction of MM-MATH represents a tripartite contribution to the field: it not only serves as a comprehensive and challenging benchmark for assessing geometric problem-solving prowess but also illuminates critical gaps in textual and visual comprehension that current models exhibit. Through this endeavor, we aspire to catalyze further research and development aimed at bridging these gaps, thereby advancing the state of multimodal model capabilities to new heights.

翻译：在本研究中，我们提出了MM-MATH数据集，这是一个新型基准测试，旨在严格评估包括GPT-4、GPT-4V和Claude等在内的先进大语言模型与多模态模型在几何计算领域中的性能表现。该数据集包含5,929个精心设计的几何问题，每个问题均配有对应图像，旨在模拟九年级数学典型的复杂性和要求。MM-MATH的提出源于多模态技术的日益关注和显著进展，这要求评估范式从单纯的结果分析转向涵盖推理和过程正确性的整体性评估。尽管模型在多项基准测试中取得了显著进步，我们的分析揭示出这些模型在从图像中准确解析和解释几何信息方面仍存在持续且显著的缺陷，该类错误占观察到的总错误的60%以上。通过采用双重聚焦的评估方法——同时考察最终结果与底层解题过程——我们发现当前多模态模型的能力与人类水平之间存在显著差距。MM-MATH的引入为该领域做出了三重贡献：它不仅为评估几何解题能力提供了一个全面且具有挑战性的基准测试，还揭示了当前模型在文本和视觉理解方面的关键短板。通过这项工作，我们期望催化进一步的研究与开发，以弥合这些差距，从而将多模态模型的能力推向新高度。