Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.
翻译:知识蒸馏是将知识从大模型迁移到小模型的过程。受压缩日益增长的语言模型需求驱动,该方法在自然语言处理领域受到越来越多的关注。本文提出f-DISTILL框架,将序列级知识蒸馏形式化为最小化广义f-散度函数。我们在该框架下提出了四种蒸馏变体,并证明现有SeqKD和ENGINE方法是我们f-DISTILL方法的近似。我们进一步推导了f-DISTILL的逐步分解方式,将难以处理的序列级散度简化为可计算的词级损失。在四个数据集上的实验表明,我们的方法优于现有知识蒸馏方法,且对称蒸馏损失能更有效地促使学生模型从教师分布中学习。