Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.
翻译:知识蒸馏(KD)被广泛用于将教师模型压缩至更小的学生模型,从而在保持模型能力的同时降低推理成本与内存占用。然而,当前针对自回归序列模型(例如大型语言模型)的KD方法缺乏标准化的目标函数。此外,近期采用学生生成输出来解决训练-推理失配问题的方法显著增加了计算开销。为解决这些问题,我们提出了DistiLLM,一种面向自回归语言模型的更高效、更有效的KD框架。DistiLLM包含两个核心组件:(1)一种新颖的偏斜Kullback-Leibler散度损失函数,我们揭示并利用了其理论特性;(2)一种自适应的离策略方法,旨在提升利用学生生成输出的效率。大量实验(包括指令跟随任务)表明,DistiLLM能够构建高性能的学生模型,同时相比近期KD方法实现了高达4.3$\times$的加速。