Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.
翻译:前沿模型在将基于任务奖励的强化学习(RL)整合到其训练流程后,展现出了卓越的能力,使得系统能够从纯粹的推理模型进化为复杂的智能体。然而,关于RL是否真正在基础模型中注入新技能,抑或仅仅是锐化其现有分布以激发潜在能力,仍存在争论。为解答这一二分法,我们通过利用RL作为实现两种范式的工具,对分布锐化与基于任务奖励的学习进行了显式比较。我们的分析揭示了分布锐化的内在局限性,从基本原理出发展示了为何及如何导致最优解不可取且方法本质上不稳定。此外,我们在数学数据集上使用Llama-3.2-3B-Instruct、Qwen2.5-3B-Instruct和Qwen3-4B-Instruct-2507进行的实验证实,锐化带来的收益有限,而引入基于任务的奖励信号可极大地帮助实现鲁棒的性能提升和稳定的学习。