Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization. Code: https://github.com/princeton-pli/Self-Distillation-Zero.

翻译：当前在可验证环境下的后训练方法分为两类。强化学习（RLVR）依赖二元奖励，虽然应用广泛且强大，但在训练中仅提供稀疏监督。蒸馏方法提供密集的令牌级监督，通常需借助外部教师或高质量演示数据，收集此类监督成本高昂或难以实现。我们提出自蒸馏零点（Self-Distillation Zero, SD-Zero），该方法在训练样本效率上显著优于强化学习，且无需外部教师或高质量演示。SD-Zero训练单一模型扮演两个角色：生成器（Generator）生成初始响应，修订器（Reviser）基于该响应及其二元奖励生成改进响应。随后通过策略内自蒸馏，将修订器的令牌分布——以生成器响应及其奖励为条件——作为监督信号蒸馏至生成器。实际上，SD-Zero训练模型将二元奖励转化为密集的令牌级自我监督。在基于Qwen3-4B-Instruct和Olmo-3-7B-Instruct的数学与代码推理基准测试中，SD-Zero相较基础模型性能提升至少10%，并在相同问题集与训练样本预算下优于拒绝微调（RFT）、GRPO及自蒸馏微调（SDFT）等强基线。大量消融实验揭示了本算法的两个新特性：(a) 令牌级自我定位，即修订器能根据奖励识别生成器响应中需修订的关键令牌；(b) 迭代自我进化，即通过定期同步教师模型，修订能力的提升可反馈蒸馏至生成性能中。代码：https://github.com/princeton-pli/Self-Distillation-Zero。