Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.
翻译:强化学习已迅速成为推理与编码模型训练中的关键组成部分,但其机制层面仍缺乏深入理解。本文通过控制数学推理实验(基于Qwen-2.5-1.5B模型)研究能力如何通过强化学习后训练被获取或增强,揭示了两种核心机制:策略选择与策略改进。研究结果强调了监督微调数据与强化学习数据在激活这些机制中的作用,特别展示了监督模型对不同推理策略的训练如何促进策略选择,以及提高强化学习数据难度如何驱动策略改进。综合而言,我们的发现为强化学习训练提供了机制性洞见,并提出了持续扩展推理能力的可行实践干预措施。