MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.

翻译：大语言模型（LLMs）在自然语言理解方面取得了显著进展，并展现出强大的问题解决能力。尽管取得了这些成功，但由于需要复杂的推理过程，大多数大语言模型在解决数学问题上仍面临困难。本文利用新开发的“MathOdyssey”数据集，研究了大语言模型的数学问题解决能力。该数据集包含高中和大学阶段多样化的数学问题，由来自知名机构的专家创建，旨在严格测试大语言模型在高级问题解决场景中的表现，并覆盖更广泛的学科领域。通过将MathOdyssey数据集作为资源提供给人工智能社区，我们旨在为理解和提升人工智能在复杂数学问题解决方面的能力做出贡献。我们对开源模型（如Llama-3和DBRX-Instruct）以及GPT系列和Gemini模型等闭源模型进行了基准测试。我们的结果表明，虽然大语言模型在常规和中等难度任务上表现良好，但在奥林匹克竞赛级别的问题和复杂的大学水平题目上仍面临重大挑战。我们的分析显示，开源模型与闭源模型之间的性能差距正在缩小，但在最具挑战性的问题上仍存在显著困难。本研究强调了持续开展研究以增强大语言模型数学推理能力的必要性。数据集、结果和代码均已公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/