Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The project is available at https://mathvision-cuhk.github.io
翻译:近期,大型多模态模型(LMM)在视觉情境数学推理任务中展现出显著进展,部分模型在MathVista等现有基准测试中已接近人类水平。然而,我们发现这些基准测试存在题目多样性不足及学科覆盖范围有限等明显局限。为解决该问题,我们提出了MATH-Vision(MATH-V)数据集——一个从真实数学竞赛中精心筛选的3,040道高质量视觉情境数学题集合。该数据集涵盖16个不同数学分支,并按5级难度梯度分级,为评估LMM的数学推理能力提供了全面且多样化的挑战。通过大量实验,我们发现当前LMM在MATH-V上的表现与人类存在显著差距,凸显了进一步推动LMM发展的必要性。此外,我们的细粒度分类体系支持对LMM进行深度错误分析,为未来研究与发展提供重要启示。项目主页:https://mathvision-cuhk.github.io