Wav2vec 2.0 (W2V2) has shown impressive performance in automatic speech recognition (ASR). However, the large model size and the non-streaming architecture make it hard to be used under low-resource or streaming scenarios. In this work, we propose a two-stage knowledge distillation method to solve these two problems: the first step is to make the big and non-streaming teacher model smaller, and the second step is to make it streaming. Specially, we adopt the MSE loss for the distillation of hidden layers and the modified LF-MMI loss for the distillation of the prediction layer. Experiments are conducted on Gigaspeech, Librispeech, and an in-house dataset. The results show that the distilled student model (DistillW2V2) we finally get is 8x faster and 12x smaller than the original teacher model. For the 480ms latency setup, the DistillW2V2's relative word error rate (WER) degradation varies from 9% to 23.4% on test sets, which reveals a promising way to extend the W2V2's application scope.
翻译:Wav2vec 2.0(W2V2)在自动语音识别(ASR)中展现了卓越的性能。然而,其庞大的模型体积和非流式架构使其难以在低资源或流式场景下应用。本研究提出一种两阶段知识蒸馏方法来解决这两个问题:第一步是将庞大且非流式的教师模型缩小,第二步是使其具备流式能力。具体地,我们采用均方误差(MSE)损失进行隐藏层的蒸馏,并采用改进的LF-MMI损失进行预测层的蒸馏。实验在Gigaspeech、Librispeech及一个内部数据集上进行。结果表明,我们最终获得的蒸馏学生模型(DistillW2V2)相比原始教师模型速度提升8倍、体积缩小12倍。在480ms延迟设置下,DistillW2V2在测试集上的相对词错误率(WER)退化幅度为9%至23.4%,这为扩展W2V2的应用范围提供了一条有前景的路径。