OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \emph{on-policy} data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

翻译：数学推理一直是大型语言模型（LLM）发展中备受关注的关键挑战。然而，由于训练数据的获取受限，LLM数学推理领域的大部分前沿进展已转向**闭源**模式。这种数据访问的缺失限制了研究人员深入理解数据合成与利用过程中不同选择的影响。为构建高质量的数学推理微调（SFT）数据集，我们基于最新发布的 \texttt{Llama3.1} 系列模型，对数据合成进行了细致的消融实验。实验结果表明：（a）解答格式至关重要，过度冗长的解决方案会损害SFT性能；（b）由强教师模型生成的数据优于弱学生模型生成的**同策略**数据；（c）SFT对低质量解答具有鲁棒性，允许采用宽松的数据过滤策略；（d）问题多样性是实现数据规模增益的关键。基于这些发现，我们构建了OpenMathInstruct-2数据集，该数据集包含1400万个问题-解答对（约60万个独立问题），其规模达到此前最大开源数学推理数据集的近八倍。使用OpenMathInstruct-2对 \texttt{Llama-3.1-8B-Base} 进行微调后，其在MATH基准上的表现超越 \texttt{Llama3.1-8B-Instruct} 达15.9\%绝对提升（51.9\% $\rightarrow$ 67.8\%）。最后，为加速开源生态发展，我们在商业友好许可下公开了代码、微调模型及OpenMathInstruct-2数据集。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/