We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.
翻译:本研究探讨了大语言模型(LLM)在小学数学(GSM)问题解决能力上的深度。为此,我们评估了模型在成对现有数学应用题上的表现,其中第二个问题的答案依赖于对第一个问题的正确解答。我们的研究结果揭示了大多数LLM中存在显著的推理差距,即解决组合问题对与独立解决每个问题之间的性能差异。这种差距在更小、更具成本效益且数学专用模型中更为明显。此外,指令微调方案和代码生成在不同规模的LLM中效果各异,而在GSM上进行微调可能导致任务过拟合。我们的分析表明,较大的推理差距并非源于测试集泄露,而是由于额外上下文带来的干扰以及第二跳推理能力不足。总体而言,尽管在标准基准测试中表现相似,但LLM在推理能力上存在系统性差异。