Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.
翻译:基于可验证奖励的强化学习为每次探索分配单一标量奖励,导致长推理过程中token级信用分配不明确。在线策略自蒸馏通过让同一模型充当基于特权信息的教师,生成密集的逐token信号来解决此问题。但常见的真实答案选择仅是一个端点线索:在简洁答案任务中,教师会在路径级指导至关重要的中间位置失效。我们提出后见自蒸馏(HSD),该方案使教师基于当前训练组中成功同伴的探索结果作为条件。此类同伴是成功条件化策略的精确样本,无需额外采样。通过提供完整的成功延续而非仅最终答案,生成的信用信号聚焦于失败探索与成功同伴之间的分歧位置。在Qwen3-8B和Qwen3-32B的数学与编程基准测试中,HSD相较GRPO变体及在线策略蒸馏基线取得最优结果,其中在AIME等简洁答案任务上增益最为显著。