Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
翻译:大型语言模型(LLMs)在众多数学推理基准测试中取得了令人瞩目的成功。然而,越来越多的人担忧其中部分表现实际上反映的是数据集污染——即训练数据中泄露了与基准测试问题高度相似的样本,而非模型展现了真正的推理能力。为严谨验证这一论断,我们构建了Grade School Math 1000(GSM1k)基准测试。GSM1k旨在模仿已有GSM8k基准测试(衡量基础数学推理的黄金标准)的风格与复杂度,并确保两个基准在人类解答率、解题步骤数、答案量级等关键指标上具有可比性。在评估主流开源与闭源LLMs在GSM1k上的表现时,我们观察到准确率下降幅度高达13%,其中多个模型系列(如Phi和Mistral)在几乎所有模型规模下均表现出系统性过拟合迹象。与此同时,许多前沿模型(如Gemini/GPT/Claude)则展现出极小的过拟合痕迹。进一步分析表明,模型生成GSM8k样本的概率与其在GSM8k与GSM1k上的性能差距呈正相关关系(Spearman's r²=0.32),这暗示许多模型可能已部分记忆了GSM8k数据集。