Data-Free Knowledge Distillation (DFKD) plays a vital role in compressing the model when original training data is unavailable. Previous works for DFKD in NLP mainly focus on distilling encoder-only structures like BERT on classification tasks, which overlook the notable progress of generative language modeling. In this work, we propose a novel DFKD framework, namely DFKD-T$^{3}$, where the pretrained generative language model can also serve as a controllable data generator for model compression. This novel framework DFKD-T$^{3}$ leads to an end-to-end learnable text-to-text framework to transform the general domain corpus to compression-friendly task data, targeting to improve both the \textit{specificity} and \textit{diversity}. Extensive experiments show that our method can boost the distillation performance in various downstream tasks such as sentiment analysis, linguistic acceptability, and information extraction. Furthermore, we show that the generated texts can be directly used for distilling other language models and outperform the SOTA methods, making our method more appealing in a general DFKD setting. Our code is available at https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3.
翻译:无数据知识蒸馏(Data-Free Knowledge Distillation, DFKD)在原始训练数据不可用时对于模型压缩具有关键作用。以往自然语言处理领域的无数据知识蒸馏工作主要聚焦于对分类任务中如BERT等仅编码器结构进行蒸馏,这忽视了生成式语言建模的显著进展。本研究提出了一种新颖的无数据知识蒸馏框架——DFKD-T$^{3}$,其中预训练的生成式语言模型也可作为可控数据生成器用于模型压缩。该新颖框架DFKD-T$^{3}$构建了一种端到端可学习的文本到文本框架,将通用领域语料库转化为利于压缩的任务数据,旨在提升数据的\textit{特异性}与\textit{多样性}。大量实验表明,我们的方法可提升情感分析、语言可接受性及信息抽取等多种下游任务的蒸馏性能。此外,我们证实生成的文本可直接用于蒸馏其他语言模型,并超越当前最先进方法,使本方法在通用无数据知识蒸馏场景中更具吸引力。我们的代码已开源至https://gitee.com/mindspore/models/tree/master/research/nlp/DFKD\_T3。