MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

翻译：大型语言模型（LLMs）与大型多模态模型（LMMs）在众多任务与领域中展现出令人瞩目的问题解决能力，但其在视觉情境下的数学推理能力尚未得到系统性研究。为弥补这一空白，我们提出MathVista——一个专门融合多样化数学与视觉任务挑战的基准测试集。该基准包含6,141个样本，源自28个涉及数学的现有多模态数据集及3个新建数据集（即IQTest、FunctionQA与PaperQA）。完成此类任务需要精细的深层视觉理解与组合推理能力，而当前所有最先进的基础模型均在此面临挑战。基于MathVista，我们对12个主流基础模型进行了全面的量化评估。表现最佳的GPT-4V模型取得了49.9%的总体准确率，显著超越第二名Bard达15.1%。深入分析表明，GPT-4V的优势主要源于其增强的视觉感知能力与数学推理能力。然而，GPT-4V仍落后人类表现10.4%，其常在复杂图形理解与严密推理环节出现困难。这一显著差距凸显了MathVista在开发能够应对数学密集型与视觉丰富型真实世界任务的通用型AI智能体过程中所扮演的关键角色。此外，我们进一步探索了GPT-4V的自验证新能力、自一致性应用及交互式聊天机器人功能，揭示了其在未来研究中的广阔前景。项目地址：https://mathvista.github.io/