Reinforcement Learning via Self-Distillation

Jonas Hübotter,Frederike Lübeck,Lejs Behric,Anton Baumann,Marco Bagatella,Daniel Marta,Ido Hakimi,Idan Shenfeld,Thomas Kleine Buening,Carlos Guestrin,Andreas Krause

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

翻译：大型语言模型越来越多地在可验证领域（如代码和数学）通过强化学习进行后训练。然而，当前基于可验证奖励的强化学习方法仅从每次尝试的标量结果奖励中学习，造成了严重的信用分配瓶颈。实际上，许多可验证环境提供了丰富的文本反馈（例如运行时错误或评判评估），用以解释尝试失败的原因。我们将此设定形式化为具有丰富反馈的强化学习，并引入自蒸馏策略优化，该方法将标记化的反馈转换为密集的学习信号，无需任何外部教师或显式的奖励模型。SDPO将基于反馈调节的当前模型视为自教师，并将其在反馈信息指导下生成的下一标记预测蒸馏回策略中。通过这种方式，SDPO利用了模型在上下文中回顾性识别自身错误的能力。在科学推理、工具使用以及LiveCodeBench v6上的竞争性编程任务中，SDPO相较于强大的RLVR基线方法，在样本效率和最终准确率上均有提升。值得注意的是，即使在仅返回标量反馈的标准RLVR环境中，SDPO通过将成功轨迹作为失败尝试的隐式反馈，其表现也优于基线方法。最后，在测试时对单个问题应用SDPO，能够加速困难二元奖励任务的探索发现，以比最佳k采样或多轮对话少3倍的尝试次数，实现相同的发现概率。