Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}
翻译:大型语言模型(LLMs)在各种自然语言任务中展现了卓越的能力,其表现往往超越人类。尽管取得了这些进步,数学领域由于其特殊的结构和对精度的要求,依然构成了一项独特的挑战。在本研究中,我们采用两步法来探究大型语言模型回答数学问题的熟练度。首先,我们根据其在数学问答基准测试中的表现,选取最有效的LLMs,为来自Math Stack Exchange(MSE)的78个问题生成答案。其次,我们对表现最佳的LLM进行案例研究,通过人工评估重点关注其答案的质量和准确性。我们发现,GPT-4在现有专为回答数学问题微调的LLMs中表现最佳(nDCG为0.48,P@10为0.37),并且在考虑P@10指标时,它超越了当前针对ArqMATH3 Task1的最佳方法。我们的案例研究表明,尽管GPT-4在某些情况下能够生成相关回应,但它并非总能准确回答所有问题。本文探讨了当前LLMs在应对复杂数学问题求解中的局限性。通过案例分析,我们揭示了LLMs在数学领域的能力差距,从而为未来人工智能驱动的数学推理研究与发展奠定基础。我们将代码和研究成果公开发布,供研究使用:\url{https://github.com/gipplab/LLM-Investig-MathStackExchange}