Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
翻译:尽管大型语言模型(LLMs)在各种应用中展现出先进的智能能力,但其仍面临显著的计算与存储需求。知识蒸馏(KD)已成为一种有效策略,通过从高性能LLM(即教师模型)迁移知识来提升较小LLM(即学生模型)的性能。当前主流的LLM蒸馏技术通常采用黑盒模型API生成高质量的预训练与对齐数据集,或通过修改损失函数进行白盒蒸馏以更好地从教师LLM迁移知识。然而,这些方法忽略了学生与教师LLM在不同领域间的知识差异。这导致模型过度关注性能差距较小的领域,而对差距较大的领域关注不足,从而降低了整体性能。本文提出一种名为DDK的新型LLM蒸馏框架,该框架根据教师模型与学生模型间的领域性能差异,以平滑方式动态调整蒸馏数据集的构成,使蒸馏过程更稳定高效。大量实验评估表明,DDK显著提升了学生模型的性能,大幅优于持续预训练的基线方法及现有知识蒸馏方法。