On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.
翻译:在线策略自蒸馏通过利用教师端的特权信息,将稀疏的验证器结果转化为密集的词元级监督信号,从而提升大语言模型的推理能力。现有方法通常假设特权信息是可信的,例如参考答案或成功推理轨迹。本文探究特权信息是否可来源于经验驱动的技能库——其中检索到的技能虽然紧凑且可复用,但可能存在不相关或误导性内容。我们提出技能条件门控自蒸馏方法(SGSD),将基于技能的自蒸馏定义为教师假设验证过程而非无条件模仿。SGSD首先检索技能-错误配对,构建多教师池,并使所有技能条件教师对同一普通提示的学生生成序列进行评分。验证器判定每位教师的极性:若支持成功或抑制失败则提供正向监督,反之则反转监督信号。通过鲁棒的门控目标函数,该方法在蒸馏信息性强的教师-学生分歧信号的同时,抑制不确定或极端信号。在多个数学推理基准上的实验表明,在较弱特权信息假设下,SGSD相较GRPO持续提升性能,且与基于答案的条件型在线策略自蒸馏方法保持竞争力。例如,在Qwen3-1.7B模型上,SGSD在AIME24、AIME25和HMMT25三个基准上平均比GRPO提升6.2%,比在线策略自蒸馏提升1.7%。代码已开源至https://github.com/walawalagoose/SGSD。