Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.
翻译:近年来,大语言模型(LLMs)的进展引发了人们对其形式推理能力,尤其是在数学领域能力的兴趣。GSM8K基准测试被广泛用于评估模型在小学水平问题上的数学推理能力。尽管近年来LLMs在GSM8K上的性能显著提升,但尚不清楚其数学推理能力是否真正取得了实质性进步,这引发了人们对所报告指标可靠性的质疑。为了解决这些问题,我们对多个最先进的开放和闭源模型进行了大规模研究。为了克服现有评估的局限性,我们引入了GSM-Symbolic,这是一个基于符号模板创建的改进基准测试,能够生成多样化的问题集。GSM-Symbolic支持更具可控性的评估,为衡量模型的推理能力提供了关键见解和更可靠的指标。我们的研究结果表明,LLMs在回答同一问题的不同实例化时表现出明显的方差。具体而言,在GSM-Symbolic基准测试中,当仅改变问题中的数值时,所有模型的性能均有所下降。此外,我们研究了这些模型中数学推理的脆弱性,并表明随着问题中子句数量的增加,其性能显著恶化。我们假设这种下降是因为当前的LLMs无法执行真正的逻辑推理;它们只是在复制其训练数据中的推理步骤。添加一个看似与问题相关但实际上对最终答案所需的推理链没有贡献的子句,会导致所有最先进模型的性能显著下降(高达65%)。总体而言,我们的工作为理解LLMs在数学推理方面的能力和局限性提供了更细致的视角。