Modern Natural Language Generation (NLG) models come with massive computational and storage requirements. In this work, we study the potential of compressing them, which is crucial for real-world applications serving millions of users. We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model, allowing to transfer knowledge from the teacher to the student. In contrast to much of the previous work, our goal is to optimize the model for a specific NLG task and a specific dataset. Typically in real-world applications, in addition to labeled data there is abundant unlabeled task-specific data, which is crucial for attaining high compression rates via KD. In this work, we conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions. We discuss the special characteristics of NLG distillation and particularly the exposure bias problem. Following, we derive a family of Pseudo-Target (PT) augmentation methods, substantially extending prior work on sequence-level KD. We propose the Joint-Teaching method, which applies word-level KD to multiple PTs generated by both the teacher and the student. Finally, we validate our findings in an extreme setup with no labeled examples using GPT-4 as the teacher. Our study provides practical model design observations and demonstrates the effectiveness of PT training for task-specific KD in NLG.
翻译:现代自然语言生成模型具有巨大的计算和存储需求。本研究探索压缩此类模型的潜在方法,这对服务于数百万用户的现实应用至关重要。我们聚焦知识蒸馏技术,通过让小型学生模型模仿大型教师模型,实现知识从教师向学生的迁移。与以往多数研究不同,我们的目标是在特定自然语言生成任务和特定数据集上优化模型。现实应用中,除标注数据外通常存在大量未标注的任务特定数据,这对通过知识蒸馏实现高压缩率至关重要。本研究在现实假设下,针对多种自然语言生成任务开展了任务特定知识蒸馏技术的系统性分析。我们探讨了自然语言生成蒸馏的特殊性,特别是暴露偏差问题。在此基础上,我们推导出伪目标增强方法家族,显著扩展了先前在序列级知识蒸馏方面的研究。我们提出联合教学法,将词级知识蒸馏应用于教师和学生模型共同生成的多个伪目标。最后,我们以GPT-4作为教师模型,在零标注样本的极端设置下验证了研究结论。本研究提供了实用的模型设计经验,并证明了伪目标训练在自然语言生成任务特定知识蒸馏中的有效性。