On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
翻译:同策略蒸馏(OPD)已成为大语言模型社区中流行的训练范式。该范式选择较大的模型作为教师,为每个采样轨迹提供密集、细粒度的信号,这与可验证奖励强化学习(RLVR)形成对比——后者仅从环境中可验证的结果中获得稀疏信号。近期,社区开始探索同策略自蒸馏(OPSD),其中同一模型同时充当教师和学生,教师通过获取参考答案等额外特权信息实现自我进化。本文证明,仅从特权教师获取的学习信号会导致严重的信息泄漏与长期训练不稳定。据此,我们确定了自蒸馏的最优生态位,并提出\textbf{RLSD}(基于自蒸馏的可验证奖励强化学习)。具体而言,我们利用自蒸馏获取词元级别的策略差异,以确定细粒度的更新幅度,同时继续使用RLVR从环境反馈(如响应正确性)中推导可靠的更新方向。这使得RLSD能够同时利用RLVR和OPSD的优势,实现更高的收敛上限与更优的训练稳定性。