Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.
翻译:知识蒸馏(KD)广泛应用于通过训练更小的学生模型来压缩教师模型,以降低其推理成本和内存占用。然而,当前用于自回归序列模型的KD方法存在训练期间所见输出序列与学生推理时生成序列之间的分布不匹配问题。为解决这一问题,我们提出广义知识蒸馏(GKD)。GKD不依赖固定的输出序列集,而是通过利用教师对学生自生成输出序列的反馈来训练学生模型。与监督式KD方法不同,GKD还可灵活地在学生与教师之间采用替代损失函数——当学生缺乏足够表达能力以模仿教师分布时,这一特性尤为实用。此外,GKD能够无缝集成蒸馏与强化学习微调(RLHF)过程。我们在摘要、翻译和算术推理任务上验证了GKD对自回归语言模型蒸馏的有效性,并展示了其在指令微调任务中的无任务特异性蒸馏效果。