Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.
翻译:大型语言模型(LLMs)能够以较高准确率解决算术应用题,但其对更复杂问题的泛化能力尚不明确。这一研究面临双重困难:(i)现有评估数据大多已被当前最强模型在训练阶段接触过;(ii)现有基准测试未能涵盖问题证明可能以多种方式呈现任意复杂性的特点。本文提出一种针对任意复杂算术证明问题的LLMs评估数据生成框架MathGAP。该框架可根据算术证明结构的规范要求,自动生成问题描述及思维链推理轨迹,从而支持针对证明树复杂度的由易到难泛化系统性研究。通过MathGAP实验发现:当证明结构变得更深更宽时,LLMs性能出现显著下降;对于复杂的非线性证明结构,这种效应尤为明显,即使最先进的模型也面临挑战。模型对语句顺序的简单调整同样敏感。然而,它们仍能解决部分复杂问题,这表明推理泛化过程具有随机性特征。