MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io

翻译：多模态大语言模型（MLLMs）的显著进步因其在视觉场景中的卓越性能而受到空前关注。然而，它们在视觉数学问题解答方面的能力仍未得到充分评估与理解。我们发现现有基准测试在文本问题中嵌入了过多视觉内容，这可能导致MLLMs无需真正解读输入图表即可推断出答案。为此，我们提出MathVerse——一个面向MLLMs公平深入评估的全方位视觉数学基准。我们从公开来源精心收集了2,612个涵盖多学科的高质量数学问题及其对应图表。每个问题由人工标注者转化为六种不同版本，各版本在多模态信息含量上呈现梯度变化，最终形成总计15,000个测试样本。这种设计使MathVerse能全面评估MLLMs是否以及能在多大程度上真正理解用于数学推理的视觉图表。此外，我们提出链式思维（CoT）评估策略以实现对输出答案的细粒度评估。不同于简单判定对错，我们采用GPT-4(V)自适应提取关键推理步骤，并对每个步骤进行带有详细错误分析的评分，从而揭示MLLMs的中间CoT推理质量。我们期望MathVerse基准能为MLLMs的未来发展提供独特洞见。项目主页：https://mathverse-cuhk.github.io

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日