Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.
翻译:大语言模型(LLMs)在自然语言理解方面取得了显著进展,并展现出强大的问题解决能力。尽管取得了这些成功,但由于需要复杂的推理过程,大多数大语言模型在解决数学问题上仍面临困难。本文利用新开发的“MathOdyssey”数据集,研究了大语言模型的数学问题解决能力。该数据集包含高中和大学阶段多样化的数学问题,由来自知名机构的专家创建,旨在严格测试大语言模型在高级问题解决场景中的表现,并覆盖更广泛的学科领域。通过将MathOdyssey数据集作为资源提供给人工智能社区,我们旨在为理解和提升人工智能在复杂数学问题解决方面的能力做出贡献。我们对开源模型(如Llama-3和DBRX-Instruct)以及GPT系列和Gemini模型等闭源模型进行了基准测试。我们的结果表明,虽然大语言模型在常规和中等难度任务上表现良好,但在奥林匹克竞赛级别的问题和复杂的大学水平题目上仍面临重大挑战。我们的分析显示,开源模型与闭源模型之间的性能差距正在缩小,但在最具挑战性的问题上仍存在显著困难。本研究强调了持续开展研究以增强大语言模型数学推理能力的必要性。数据集、结果和代码均已公开。