Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. Leveraging the capabilities of PLMs to enhance automatic speech recognition (ASR) systems has also emerged as a promising research direction. However, previous works may be limited by the inflexible structures of PLMs and the insufficient utilization of PLMs. To alleviate these problems, we propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge distillation with contrastive loss at the acoustic level and knowledge distillation with regression loss at the linguistic level. Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets, respectively.
翻译:大规模预训练语言模型(PLMs)在自然语言处理任务中展现出巨大潜力。利用PLMs的能力来增强自动语音识别(ASR)系统也已成为一个颇具前景的研究方向。然而,先前的研究可能受限于PLMs结构的不灵活性以及对其利用不充分的问题。为解决这些问题,我们提出了针对基于连续积分-触发(CIF)的ASR模型的层级知识蒸馏(HKD)方法。为了将知识从PLMs迁移至ASR模型,HKD在声学层面采用基于对比损失的跨模态知识蒸馏,在语言学层面采用基于回归损失的知识蒸馏。与原始基于CIF的模型相比,我们的方法在AISHELL-1和LibriSpeech数据集上分别实现了15%和9%的相对错误率降低。