提升语言模型的数学推理能力：解题数据、数据合成方法与训练阶段的影响 (Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages)

Mathematical reasoning remains a challenging area for large language models (LLMs), prompting the development of math-specific LLMs such as LLEMMA, DeepSeekMath, and Qwen2-Math, among others. These models typically follow a two-stage training paradigm: pre-training with math-related corpora and post-training with problem datasets for supervised fine-tuning (SFT). Despite these efforts, the improvements in mathematical reasoning achieved through continued pre-training (CPT) are often less significant compared to those obtained via SFT. This study addresses this discrepancy by exploring alternative strategies during the pre-training phase, focusing on the use of problem-solving data over general mathematical corpora. We investigate three primary research questions: (1) Can problem-solving data enhance the model's mathematical reasoning capabilities more effectively than general mathematical corpora during CPT? (2) Are synthetic data from the same source equally effective, and which synthesis methods are most efficient? (3) How do the capabilities developed from the same problem-solving data differ between the CPT and SFT stages, and what factors contribute to these differences? Our findings indicate that problem-solving data significantly enhances the model's mathematical capabilities compared to general mathematical corpora. We also identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance. Furthermore, while SFT facilitates instruction-following abilities, it underperforms compared to CPT with the same data, which can be partially attributed to its poor learning capacity for more challenging problem-solving data. These insights provide valuable guidance for optimizing the mathematical reasoning capabilities of LLMs, culminating in our development of a powerful mathematical base model called MathGPT-8B.

翻译：数学推理对于大语言模型（LLMs）而言仍然是一个具有挑战性的领域，这促使了诸如LLEMMA、DeepSeekMath、Qwen2-Math等数学专用LLMs的发展。这些模型通常遵循两阶段训练范式：使用数学相关语料库进行预训练，以及使用问题数据集进行监督微调（SFT）的后训练。尽管做出了这些努力，但通过持续预训练（CPT）所实现的数学推理能力提升，通常不如通过SFT获得的提升显著。本研究通过在预训练阶段探索替代策略来解决这一差异，重点关注使用解题数据而非通用数学语料库。我们探讨了三个主要研究问题：（1）在CPT期间，解题数据能否比通用数学语料库更有效地增强模型的数学推理能力？（2）来自同一来源的合成数据是否同样有效，哪些合成方法效率最高？（3）从相同解题数据发展出的能力在CPT和SFT阶段有何不同，哪些因素导致了这些差异？我们的研究结果表明，与通用数学语料库相比，解题数据显著增强了模型的数学能力。我们还识别了有效的数据合成方法，证明导师制扩增合成方法取得了最佳性能。此外，虽然SFT促进了指令跟随能力，但与使用相同数据的CPT相比，其表现欠佳，这部分归因于其对更具挑战性的解题数据的学习能力较差。这些见解为优化LLMs的数学推理能力提供了宝贵的指导，最终促成了我们开发一个名为MathGPT-8B的强大数学基础模型。