Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive T5 language models on summarization, translation, and arithmetic reasoning tasks as well as task-agnostic instruction tuning.
翻译:知识蒸馏(KD)通过训练较小的学生模型来压缩教师模型,从而降低其推理成本和内存占用,已被广泛应用。然而,当前面向自回归序列模型的蒸馏方法存在分布失配问题:训练过程中使用的输出序列与推理阶段学生模型生成的输出序列之间存在差异。为解决该问题,我们提出广义知识蒸馏(GKD)。不同于仅依赖固定输出序列集合的传统方法,GKD通过利用教师模型对学生自生成输出序列的反馈来训练学生模型。与监督式蒸馏方法相比,GKD还可灵活采用学生与教师之间的替代损失函数,这在学生模型因表达能力不足而难以模仿教师分布时尤为有用。此外,GKD能够无缝集成蒸馏与强化学习微调(RLHF)。我们通过在摘要生成、翻译、算术推理任务以及任务无关指令调优中蒸馏自回归T5语言模型,验证了GKD的有效性。