Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations. We introduce the adversarial grade school math (\datasetname) dataset, an extension of GSM8K augmented with various mathematical perturbations. Our experiments on 25 LLMs and 4 prompting techniques show that while LLMs exhibit different levels of math reasoning abilities, their performances are far from robust. In particular, even for problems that have been solved in GSM8K, LLMs can make mistakes when new statements are added or the question targets are altered. We also explore whether more robust performance can be achieved by composing existing prompting methods, in which we try an iterative method that generates and verifies each intermediate thought based on its reasoning goal and calculation result. Code and data are available at \url{https://github.com/qtli/GSM-Plus}.
翻译:大语言模型(LLMs)在各种数学推理基准任务中已展现出令人瞩目的性能。然而,关于这些模型是否真正理解并应用数学知识,抑或仅仅依赖捷径进行数学推理的争论日益增多。一个关键且频繁出现的证据是:当数学问题发生细微变化时,LLMs可能做出错误回答。这促使我们通过测试多种问题变体来评估LLMs数学推理能力的鲁棒性。我们引入了对抗性小学数学题(\datasetname)数据集,这是对GSM8K进行多种数学扰动扩展后的增强版本。对25个LLMs和4种提示技术的实验表明:尽管LLMs展现出不同程度的数学推理能力,但其表现远未达到鲁棒水平。特别值得注意的是,即便是在GSM8K中已被成功求解的问题,当添加新的陈述或改变提问目标时,LLMs仍会犯错。我们还探索了能否通过组合现有提示方法实现更鲁棒的性能,其中尝试了一种基于推理目标和计算结果生成并验证每个中间步骤的迭代方法。代码与数据见\url{https://github.com/qtli/GSM-Plus}。