Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

from arxiv, Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) July 14--18, 2024, Washington D.C.,USA

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}

翻译：大型语言模型（LLMs）在各种自然语言任务中展现了卓越的能力，其表现往往超越人类。尽管取得了这些进步，数学领域由于其特殊的结构和对精度的要求，依然构成了一项独特的挑战。在本研究中，我们采用两步法来探究大型语言模型回答数学问题的熟练度。首先，我们根据其在数学问答基准测试中的表现，选取最有效的LLMs，为来自Math Stack Exchange（MSE）的78个问题生成答案。其次，我们对表现最佳的LLM进行案例研究，通过人工评估重点关注其答案的质量和准确性。我们发现，GPT-4在现有专为回答数学问题微调的LLMs中表现最佳（nDCG为0.48，P@10为0.37），并且在考虑P@10指标时，它超越了当前针对ArqMATH3 Task1的最佳方法。我们的案例研究表明，尽管GPT-4在某些情况下能够生成相关回应，但它并非总能准确回答所有问题。本文探讨了当前LLMs在应对复杂数学问题求解中的局限性。通过案例分析，我们揭示了LLMs在数学领域的能力差距，从而为未来人工智能驱动的数学推理研究与发展奠定基础。我们将代码和研究成果公开发布，供研究使用：\url{https://github.com/gipplab/LLM-Investig-MathStackExchange}