Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
翻译:近年来,大型语言模型(LLMs)的快速发展引发了对其推理成本的担忧,从而增加了对模型压缩研究的需求。知识蒸馏(KD)是实现这一目标的重要方法,然而针对LLMs等生成式语言模型的知识蒸馏研究相对较少,且在分类模型知识蒸馏中表现出色的学生友好型知识蒸馏方法,在生成式语言模型中尚未得到探索。为探索这一方向,我们提出了PromptKD,这是一种简单而有效的方法,首次在知识蒸馏中利用提示调优技术,使生成式语言模型能够传递学生友好型知识。与以往分类任务中需要微调整个教师模型以提取学生友好型知识的工作不同,PromptKD仅需添加少量提示词元,并在学生指导下仅对这些提示进行调优,即可达到类似效果。在指令跟随数据集上的大量实验表明,PromptKD仅需添加相当于教师模型参数0.0007%的提示即可实现最先进的性能。进一步分析表明,蒸馏学生友好型知识能在整个训练过程中有效缓解曝光偏差,从而带来性能提升。