Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between acoustic frames and output tokens, which might bring redundant computation. Previous studies managed to accelerate the training and inference of neural Transducers by discarding frames based on the blank symbols predicted by a co-trained CTC. However, there is no guarantee that the co-trained CTC can maximize the ratio of blank symbols. This paper proposes two novel regularization methods to explicitly encourage more blanks by constraining the self-loop of non-blank symbols in the CTC. It is interesting to find that the frame reduction ratio of the neural Transducer can approach the theoretical boundary. Experiments on LibriSpeech corpus show that our proposed method accelerates the inference of neural Transducer by 4 times without sacrificing performance. Our work is open-sourced and publicly available https://github.com/k2-fsa/icefall.
翻译:神经换能器与连接主义时间分类(CTC)是流行的端到端自动语音识别系统。由于其帧同步设计,引入空白符号以解决声学帧与输出标记之间的长度不匹配问题,这可能会带来冗余计算。先前的研究通过基于联合训练的CTC预测的空白符号丢弃帧,成功加速了神经换能器的训练与推理。然而,联合训练的CTC无法保证最大化空白符号的比例。本文提出两种新颖的正则化方法,通过约束CTC中非空白符号的自循环来显式鼓励更多空白符号。有趣的是,发现神经换能器的帧缩减率可接近理论边界。在LibriSpeech语料库上的实验表明,我们的方法在性能不受损的情况下将神经换能器的推理速度提升4倍。我们的工作已开源并公开于https://github.com/k2-fsa/icefall。