Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io
翻译:知识蒸馏为将大型教师模型的推理能力迁移至高效学生模型提供了一条前景广阔的路径;然而,现有的基于策略的令牌级蒸馏方法要求学生模型与教师模型在令牌级别上对齐,这限制了学生模型的探索能力,阻碍了其有效利用交互环境反馈,并在强化学习中遭遇严重的内存瓶颈。我们提出了基于策略的言语蒸馏(OVD),这是一个内存高效的框架,它利用教师模型提供的离散言语评分(0-9)进行轨迹匹配,取代了令牌级概率匹配。OVD在实现基于策略的、带有言语反馈的教师模型蒸馏的同时,大幅降低了内存消耗,并且避免了令牌级对齐,允许学生模型自由探索输出空间。在Web问答和数学推理任务上进行的大量实验表明,OVD显著优于现有方法,在Web问答任务上的平均精确匹配(EM)绝对提升高达+12.9%,在数学基准测试上(当仅使用一个随机样本进行训练时)提升高达+25.7%,同时展现出更优的训练效率。我们的项目页面位于 https://OVD.github.io。