Knowledge distillation is commonly used for compressing neural networks to reduce their inference cost and memory footprint. However, current distillation methods for auto-regressive models, such as generative language models (LMs), suffer from two key issues: (1) distribution mismatch between output sequences during training and the sequences generated by the student during its deployment, and (2) model under-specification, where the student model may not be expressive enough to fit the teacher's distribution. To address these issues, we propose Generalized Knowledge Distillation (GKD). GKD mitigates distribution mismatch by sampling output sequences from the student during training. Furthermore, GKD handles model under-specification by optimizing alternative divergences, such as reverse KL, that focus on generating samples from the student that are likely under the teacher's distribution. We demonstrate that GKD outperforms commonly-used approaches for distilling LLMs on summarization, machine translation, and arithmetic reasoning tasks.
翻译:知识蒸馏通常用于压缩神经网络以降低其推理成本和内存占用。然而,当前用于自回归模型(如生成式语言模型(LM))的蒸馏方法存在两个关键问题:(1)训练期间输出序列的分布与学生模型部署时生成的序列之间的分布不匹配;(2)模型欠拟合,即学生模型可能不具备足够的表达能力来拟合教师模型的分布。为解决这些问题,我们提出广义知识蒸馏(GKD)。GKD通过在训练过程中从学生模型中采样输出序列来缓解分布不匹配问题。此外,GKD通过优化替代散度(如反向KL散度)来处理模型欠拟合,该散度侧重于从学生模型中生成在教师模型分布下具有高似然的样本。我们证明,在摘要生成、机器翻译和算术推理任务中,GKD在蒸馏LLM方面优于常用方法。