Streaming automatic speech recognition (ASR) models are restricted from accessing future context, which results in worse performance compared to the non-streaming models. To improve the performance of streaming ASR, knowledge distillation (KD) from the non-streaming to streaming model has been studied, mainly focusing on aligning the output token probabilities. In this paper, we propose a layer-to-layer KD from the teacher encoder to the student encoder. To ensure that features are extracted using the same context, we insert auxiliary non-streaming branches to the student and perform KD from the non-streaming teacher layer to the non-streaming auxiliary layer. We design a special KD loss that leverages the autoregressive predictive coding (APC) mechanism to encourage the streaming model to predict unseen future contexts. Experimental results show that the proposed method can significantly reduce the word error rate compared to previous token probability distillation methods.
翻译:流式自动语音识别(ASR)模型受限于无法访问未来上下文,导致其性能劣于非流式模型。为提升流式ASR的性能,研究者们探索了从非流式模型向流式模型的知识蒸馏(KD)方法,主要聚焦于对齐输出词元概率。本文提出一种从教师编码器到学生编码器的层到层知识蒸馏方法。为确保特征提取使用相同上下文,我们在学生模型中插入辅助非流式分支,并对非流式教师层与非流式辅助层进行知识蒸馏。我们设计了一种特殊的知识蒸馏损失函数,利用自回归预测编码(APC)机制,促使流式模型预测未见过的未来上下文。实验结果表明,与先前的词元概率蒸馏方法相比,所提方法能显著降低词错误率。