Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network. The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively. Experiments on SUPERB reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks. Ultimately, the proposed GenDistiller reduces the size of WavLM by 82%.
翻译:诸如HuBERT和WavLM等预训练语音语言模型利用无标注语音数据进行自监督学习,并为众多下游任务提供了强大的表征。尽管这些模型取得了成功,但其对内存和计算资源的高要求阻碍了其在资源受限设备上的应用。为此,本文提出了GenDistiller,一种新颖的知识蒸馏框架,通过一个更小的学生网络直接生成预训练教师模型的隐藏表征。该方法将前一隐藏层作为历史信息,并以自回归方式逐层预测教师模型的输出。在SUPERB基准上的实验表明,相较于无自回归框架的基线蒸馏方法,GenDistiller在参数减少33%、时间开销相近的情况下,在大多数SUPERB任务上取得了更优的性能。最终,所提出的GenDistiller将WavLM的模型大小缩减了82%。