Privileged Information Distillation for Language Models

Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, which typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable, but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically, we find that π-Distill and, in some cases, OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.

翻译：训练时的特权信息（PI）能够使语言模型在原本会失败的任务上取得成功，这使其成为困难、长周期场景中强化学习的强大工具。然而，将借助PI学到的能力迁移到推理时必须无PI运行的策略上，仍然是一个根本性挑战。我们在为多轮智能体环境蒸馏前沿模型的背景下研究此问题，这类环境通常隐藏其内部推理过程，仅暴露动作轨迹。这打破了标准蒸馏流程，因为成功的行为是可观测的，但推理过程却不可见。为此，我们提出了π-Distill，这是一种联合师生目标，使用同一模型同时训练一个PI条件化的教师模型和一个无条件化的学生模型。此外，我们还提出了策略上自蒸馏（OPSD），这是一种替代方法，它通过强化学习（RL）进行训练，并在学生模型与PI条件化教师模型之间施加反向KL惩罚。我们证明这两种算法都能有效地利用仅含动作的PI来蒸馏前沿智能体。具体而言，我们发现π-Distill以及在某些情况下的OPSD，在多个智能体基准测试、模型和PI形式上，均优于假设能获得完整思维链监督的行业标准实践（监督微调后接RL）。我们通过广泛的分析来补充实验结果，这些分析刻画了实现PI有效学习的因素，主要聚焦于π-Distill，并阐明了OPSD在何种情况下具有竞争力。