In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.
翻译:在大语言模型(LLM)领域,知识蒸馏(KD)是将教师模型能力迁移至学生模型的关键技术。然而,现有KD方法在LLM蒸馏中面临效率限制及传统KL散度度量能力不足等挑战。研究表明,LLM可作为隐式奖励函数,我们将其定义为KL散度的补充。本研究提出面向LLM的直接偏好知识蒸馏(DPKD)方法。DPKD利用分布散度表征偏好损失与隐式奖励函数,将LLM的知识蒸馏重构为两个阶段:首先优化由隐式奖励与反向KL散度构成的目标函数,继而提升教师输出相对于学生输出的偏好概率。我们在参数规模从1.2亿到130亿不等的多个LLM数据集上进行了实验分析,验证了DPKD方法的广泛适用性与有效性。同时,通过实验与理论分析证明了所引入的隐式奖励与输出偏好在知识蒸馏中的价值与效能。DPKD方法在输出响应精度和精确匹配百分比方面均优于基线方法。代码与数据详见 https://aka.ms/dpkd。