Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data without modifying the model structure. Since the semantic and granularity gap between text and speech has been omitted in literature, which impairs the distillation, we propose the Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages text/speech units of variable granularity and prior distributions to achieve better global and local alignments between text and speech pre-trained models. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
翻译:大规模语音语料库上的学习推动了众多自监督语音模型的最新进展。通过知识蒸馏,这些模型也能受益于在丰富文本资源上预训练的语言模型所编码的知识。然而,由于文本和语音嵌入空间之间的模态差异,蒸馏过程颇具挑战性。本文研究了基于度量的蒸馏方法,仅需少量数据即可对齐文本和语音的嵌入空间,且无需修改模型结构。鉴于现有文献忽视了文本与语音之间的语义和粒度差异(这一问题阻碍了蒸馏效果),我们提出了先验感知自适应知识蒸馏(PAD)方法,该方法自适应地利用可变粒度的文本/语音单元及先验分布,实现文本和语音预训练模型之间更优的全局与局部对齐。我们在三个口语理解基准上的评估表明,PAD在迁移语言知识方面优于其他基于度量的蒸馏方法。