Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. The shared decoder is changed into a tied-and-reduced network. The cascaded two-pass model is further compressed using knowledge distillation using a Kullback-Leibler divergence loss on the model posteriors. We demonstrate a 50% reduction in the size of a 41 M parameter cascaded teacher model with no noticeable degradation in ASR accuracy and a 30% reduction in latency
翻译:近期,级联双通路架构已成为设备端自动语音识别(ASR)的有力竞争方案。由因果编码器与浅层非因果编码器级联构成的架构,配合共享解码器,可同时支持流式与前瞻两种工作模式。本文通过融合知识蒸馏、共享解码器及绑定压缩换能器网络等多种模型压缩技术,提出浅层级联模型以减小模型体积。将共享解码器改造为绑定压缩网络,并采用基于模型后验概率的库尔巴克-莱布勒散度损失函数,通过知识蒸馏进一步压缩级联双通路模型。实验表明,在ASR准确率无明显下降的前提下,参数量为4100万的级联教师模型体积缩减50%,延迟降低30%。