Modern Natural Language Generation (NLG) models come with massive computational and storage requirements. In this work, we study the potential of compressing them, which is crucial for real-world applications serving millions of users. We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model, allowing to transfer knowledge from the teacher to the student. In contrast to much of the previous work, our goal is to optimize the model for a specific NLG task and a specific dataset. Typically, in real-world applications, in addition to labeled data there is abundant unlabeled task-specific data, which is crucial for attaining high compression rates via KD. In this work, we conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions. We discuss the special characteristics of NLG distillation and particularly the exposure bias problem. Following, we derive a family of Pseudo-Target (PT) augmentation methods, substantially extending prior work on sequence-level KD. We propose the Joint-Teaching method for NLG distillation, which applies word-level KD to multiple PTs generated by both the teacher and the student. Our study provides practical model design observations and demonstrates the effectiveness of PT training for task-specific KD in NLG.
翻译:现代自然语言生成(NLG)模型通常具有巨大的计算和存储需求。本研究探讨压缩这些模型的潜力,这对于服务于数百万用户的实际应用至关重要。我们聚焦于知识蒸馏(KD)技术——小型学生模型通过模仿大型教师模型进行学习,从而实现从教师到学生的知识迁移。与以往多数工作不同,我们的目标是针对特定NLG任务和特定数据集优化模型。在实际应用中,除了标注数据外,通常还存在大量无标注的任务特定数据,这些数据对通过KD实现高压缩率至关重要。本研究在现实假设下,对面向多种NLG任务的任务特定KD技术进行了系统性研究。我们讨论了NLG蒸馏的特殊特性,特别是曝光偏差问题。基于此,我们推导出一系列伪目标(PT)增强方法,大幅拓展了序列级KD的先前工作。我们提出了面向NLG蒸馏的联合教学(Joint-Teaching)方法,该方法对教师和学生模型共同生成的多个PT应用词级KD。本研究提供了实用的模型设计经验,并证明了PT训练在NLG任务特定KD中的有效性。