Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
翻译:近期在应对复杂数学推理的大型推理模型方面取得的进展主要得益于强化学习(RL)的推动。在中训练阶段融入长思维链(CoT)数据也被证明能显著提升推理深度。然而,当前方法往往不加区分地使用CoT数据,这留下了一个关键问题:哪些数据类型能最有效地增强模型的推理能力?本文首次将基础模型的推理潜力定义为正确回答问题所需独立尝试次数的倒数,该指标与最终模型性能高度相关。随后,我们提出利用富含高价值推理模式的多样化数据来扩展推理潜力。具体而言,我们从CoT序列中抽象出具有共性和归纳能力的原子推理模式,并利用它们构建一个富含价值推理模式的核心参考集。此外,我们提出了一种涉及推理模式链与词元熵的双粒度算法,能够高效地从数据池中筛选出与核心集对齐的高价值CoT数据(CoTP),从而有效训练模型掌握推理能力。仅使用100亿词元的CoTP数据,即可使拥有850亿参数(85A6B)的混合专家(MoE)模型在具有挑战性的AIME 2024和2025测试中提升9.58%,并将下游RL性能上限提高7.81%。