Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
翻译:大型语言模型(LLMs)在多项数学推理基准测试中取得了令人瞩目的成功。然而,人们日益担忧其中部分表现实际上反映的是数据集污染问题——即与基准测试题目高度相似的数据在训练过程中发生泄露,而非模型具备真正的推理能力。为严谨验证这一论断,我们专门编制了小学数学1000题(GSM1k)基准。该基准旨在对标成熟基准GSM8k(衡量基础数学推理能力的黄金标准)的风格与复杂度。我们确保这两个基准在人类解题率、解题步骤数量、答案量级等关键指标上具有可比性。在GSM1k上评估主流开源与闭源LLMs时,我们观察到部分模型准确率下降高达13%,其中多个模型家族(如Phi和Mistral)在几乎所有参数量级上都呈现出系统性过拟合迹象。与此同时,许多模型(尤其是前沿模型如Gemini/GPT/Claude)显示出极小的过拟合迹象。进一步分析表明,模型生成GSM8k样本的概率与其在GSM8k与GSM1k之间的性能差距呈正相关(斯皮尔曼秩相关系数r²=0.32),这暗示许多模型可能已部分记忆了GSM8k数据。