Large-scale pre-trained language models (PLMs) with powerful language modeling capabilities have been widely used in natural language processing. For automatic speech recognition (ASR), leveraging PLMs to improve performance has also become a promising research trend. However, most previous works may suffer from the inflexible sizes and structures of PLMs, along with the insufficient utilization of the knowledge in PLMs. To alleviate these problems, we propose the hierarchical knowledge distillation on the continuous integrate-and-fire (CIF) based ASR models. Specifically, we distill the knowledge from PLMs to the ASR model by applying cross-modal distillation with contrastive loss at the acoustic level and applying distillation with regression loss at the linguistic level. On the AISHELL-1 dataset, our method achieves 15% relative error rate reduction over the original CIF-based model and achieves comparable performance (3.8%/4.1% on dev/test) to the state-of-the-art model.
翻译:大规模预训练语言模型(PLMs)凭借强大的语言建模能力已在自然语言处理领域得到广泛应用。在自动语音识别(ASR)中,利用PLMs提升性能已成为一个具有前景的研究趋势。然而,多数现有方法可能受限于PLMs僵化的规模与结构,以及对其知识的利用不充分。为此,我们提出针对连续集成与激发(CIF)型ASR模型的分层知识蒸馏方法。具体而言,我们在声学层面通过对比损失进行跨模态蒸馏,在语言层面通过回归损失进行蒸馏,以此将PLMs的知识迁移至ASR模型。在AISHELL-1数据集上,我们的方法相比原始CIF模型实现了15%的相对错误率降低,并在开发集/测试集上取得了与现有最优模型相当的性能(3.8%/4.1%)。