Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \emph{on-policy} data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.
翻译:数学推理一直是大型语言模型(LLM)发展中备受关注的关键挑战。然而,由于训练数据的获取受限,LLM数学推理领域的大部分前沿进展已转向**闭源**模式。这种数据访问的缺失限制了研究人员深入理解数据合成与利用过程中不同选择的影响。为构建高质量的数学推理微调(SFT)数据集,我们基于最新发布的 \texttt{Llama3.1} 系列模型,对数据合成进行了细致的消融实验。实验结果表明:(a)解答格式至关重要,过度冗长的解决方案会损害SFT性能;(b)由强教师模型生成的数据优于弱学生模型生成的**同策略**数据;(c)SFT对低质量解答具有鲁棒性,允许采用宽松的数据过滤策略;(d)问题多样性是实现数据规模增益的关键。基于这些发现,我们构建了OpenMathInstruct-2数据集,该数据集包含1400万个问题-解答对(约60万个独立问题),其规模达到此前最大开源数学推理数据集的近八倍。使用OpenMathInstruct-2对 \texttt{Llama-3.1-8B-Base} 进行微调后,其在MATH基准上的表现超越 \texttt{Llama3.1-8B-Instruct} 达15.9\%绝对提升(51.9\% $\rightarrow$ 67.8\%)。最后,为加速开源生态发展,我们在商业友好许可下公开了代码、微调模型及OpenMathInstruct-2数据集。