Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.
翻译:将语言模型置于额外上下文(如对先前尝试的反馈)中通常能提升其响应质量。自蒸馏通过训练模型在缺乏上下文时保留这种改进效果。该方法通过匹配模型在两种设置下的输出分布来实现:仅看到问题的学生模型,以及同时看到上下文的自教师模型。因此,模型所学内容取决于自教师接收的上下文,然而上下文的设置尚未得到充分探索。我们通过训练求解器接收冻结评判器的反馈来研究自蒸馏的上下文设计。我们比较了三种条件:(i) 二元奖励(GRPO),(ii) 参考解,以及(iii) 与求解器推理轨迹对齐的分步批评。分步对齐的批评取得了最大增益,分别比GRPO高16.11分,比以参考解为条件的自蒸馏高5.27分(Avg@12)。逐词元优势分析揭示了原因:分步对齐的反馈仅针对推理失败的词元,而保留正确行为。相比之下,以参考解为条件则会迫使模型在每一个词元(包括正确步骤)处改变行为,因为不同的推导在措辞和方式上必然存在差异。这表明反馈与求解器推理之间的结构对齐是自蒸馏效果的关键驱动因素。