The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
翻译:多模态大语言模型(MLLMs)的显著进步因其在视觉场景中的卓越性能而受到空前关注。然而,它们在视觉数学问题解答方面的能力仍未得到充分评估与理解。我们发现现有基准测试在文本问题中嵌入了过多视觉内容,这可能导致MLLMs无需真正解读输入图表即可推断出答案。为此,我们提出MathVerse——一个面向MLLMs公平深入评估的全方位视觉数学基准。我们从公开来源精心收集了2,612个涵盖多学科的高质量数学问题及其对应图表。每个问题由人工标注者转化为六种不同版本,各版本在多模态信息含量上呈现梯度变化,最终形成总计15,000个测试样本。这种设计使MathVerse能全面评估MLLMs是否以及能在多大程度上真正理解用于数学推理的视觉图表。此外,我们提出链式思维(CoT)评估策略以实现对输出答案的细粒度评估。不同于简单判定对错,我们采用GPT-4(V)自适应提取关键推理步骤,并对每个步骤进行带有详细错误分析的评分,从而揭示MLLMs的中间CoT推理质量。我们期望MathVerse基准能为MLLMs的未来发展提供独特洞见。项目主页:https://mathverse-cuhk.github.io