On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe by combining two desirable ingredients: on-policy student trajectories and dense teacher supervision. However, how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and \textsc{OPD} use cases, our analysis yields two main findings. On sparsity, \textsc{OPD} updates are small and coordinate-sparse. They are distributed across layers, with the largest relative movement usually appearing in FFN modules. This sparse structure is operationally useful: training only the discovered subnetwork nearly recovers full-training performance. The sparse support does not remove the need for adaptive optimization: SGD, previously reported to be competitive in \textsc{RLVR}, underperforms AdamW in our \textsc{OPD} optimizer ablation, suggesting that dense teacher supervision preserves useful momentum structure and heterogeneous second-moment scales. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.
翻译:策略内蒸馏(\textsc{OPD})通过结合两个理想要素——策略内学生轨迹与稠密教师监督——近期成为一种重要的后训练范式。然而,这种混合机制如何改变模型参数仍不明确。通过分析多组语言与视觉-语言模型对及其\textsc{OPD}应用场景,我们的研究得出两个主要发现。在稀疏性方面,\textsc{OPD}更新量小且坐标稀疏,各层更新分布不均,其中前馈网络模块的相对变动最大。这种稀疏结构具有实际价值:仅训练所发现的子网络即可接近完整训练性能。但稀疏支撑并未消除对自适应优化的需求:先前在\textsc{RLVR}中被报告具有竞争力的SGD,在我们的\textsc{OPD}优化器消融实验中表现逊于AdamW,表明稠密教师监督保留了有益的动量结构与异质二阶矩尺度。在几何特性方面,更新在数值上满秩但谱集中;它们主要偏离源权重的主奇异子空间,并不成比例地落在源权重接近零的坐标上。这些发现表明,稠密教师监督并未将\textsc{OPD}转变为普通的稠密参数重写;相反,\textsc{OPD}保留了策略内后训练的重要几何特征。