This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. By using the intermediate layers as distillation target, we can more effectively distil LM knowledge into the lower network layers. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding. Experiments on the LibriSpeech dataset demonstrate the effectiveness of our approach in enhancing greedy decoding with connectionist temporal classification (CTC).
翻译:本研究提出了一种新颖的知识蒸馏方法,通过中间层将BERT教师模型的知识蒸馏至自动语音识别(ASR)模型。为提取教师知识,我们采用一种注意力解码器,该解码器从BERT的令牌概率中学习。实验表明,利用中间层与最终层联合蒸馏,可将语言模型(LM)信息更有效地注入ASR模型。通过将中间层作为蒸馏目标,我们能更充分地将LM知识迁移至网络低层。本方法在识别准确率上优于外部语言模型浅融合方法,同时保持快速并行解码能力。在LibriSpeech数据集上的实验验证了该方法在增强基于连接时序分类(CTC)的贪心解码中的有效性。