Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in practice.On-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher's familiar state distribution, so the teacher's supervision becomes least reliable precisely where the student needs it most.We propose Guided On-Policy Distillation (Guided-OPD), a simple yet effective algorithm that mixes teacher- and student-generated turns within each rollout and schedules the teacher's intervention probability along a curriculum that decays to zero.Strong guidance keeps early trajectories close to the teacher distribution and is then gradually withdrawn to recover the purely on-policy regime used at inference.On ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1\% and Success Rate by 25.5\% over vanilla OPD on average, with larger gains on smaller students.
翻译:多轮智能体通过规划、调用工具并与环境交互,为解决复杂任务提供了有前景的范式,但其能力通常依赖规模极大的模型,导致推理成本在实践中难以承受。在线策略蒸馏(On-Policy Distillation, OPD)是将此类能力迁移至较小学生模型的有效策略,但我们发现该方法在此场景中存在典型失效模式:学生模型在轮次间累积的小误差会将轨迹推离教师模型的熟悉状态分布,导致教师模型在最需要监督的环节提供最不可靠的指导。我们提出引导式在线策略蒸馏(Guided-OPD),一种简洁高效的算法,该算法在每个轨迹生成轮次中混合教师与学生生成的交互步骤,并按照逐步衰减至零的课程式规划策略调度教师干预概率。强引导机制使早期轨迹紧贴教师分布,随后逐步撤销引导以恢复推理阶段使用的纯在线策略模式。在ALFWorld、ScienceWorld及WebShop数据集上,通过将Qwen3-30B-A3B教师模型蒸馏至Qwen3学生模型,Guided-OPD相较于原始OPD方法,平均得分提升21.1%,成功率提升25.5%,且学生模型规模越小提升幅度越大。