In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B LLMs using our proposed 2.5M-instance Skywork-MathQA dataset. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark and 83.9% on the GSM8K benchmark using only SFT data, outperforming an early version of GPT-4 on MATH. The superior performance of Skywork-Math models contributes to our novel two-stage data synthesis and model SFT pipelines, which include three different augmentation methods and a diverse seed problem set, ensuring both the quantity and quality of Skywork-MathQA dataset across varying difficulty levels. Most importantly, we provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
翻译:本文旨在探究可能提升大语言模型(LLMs)数学推理能力的潜在因素。我们认为,现代大语言模型中数学推理能力的数据缩放定律远未达到饱和,这突显了模型质量如何随着数据量的增加而提升。为支持这一论点,我们引入了Skywork-Math模型系列,该系列基于我们提出的包含250万个样本的Skywork-MathQA数据集,对常见的70亿参数大语言模型进行了监督微调(SFT)。Skywork-Math 7B模型在仅使用SFT数据的情况下,于竞赛级MATH基准测试中取得了51.2%的准确率,在GSM8K基准测试中取得了83.9%的准确率,其MATH表现超越了早期版本的GPT-4。Skywork-Math模型的卓越性能得益于我们新颖的两阶段数据合成与模型SFT流程,该流程包含三种不同的数据增强方法及一个多样化的种子问题集,确保了Skywork-MathQA数据集在不同难度级别上兼具数量与质量。最重要的是,我们为研究和工业应用提供了若干提升大语言模型数学推理能力的实用建议。