Streaming automatic speech recognition (ASR) is very important for many real-world ASR applications. However, a notable challenge for streaming ASR systems lies in balancing operational performance against latency constraint. Recently, a method of chunking, simulating future context and decoding, called CUSIDE, has been proposed for connectionist temporal classification (CTC) based streaming ASR, which obtains a good balance between reduced latency and high recognition accuracy. In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. We also incorporate language model rescoring in CUSIDE-T to further enhance accuracy, while only bringing a small additional latency. Extensive experiments are conducted over the AISHELL-1, WenetSpeech and SpeechIO datasets, comparing CUSIDE-T and U2++ (both based on RNN-T). U2++ is an existing counterpart of chunk based streaming ASR method. It is shown that CUSIDE-T achieves superior accuracy performance for streaming ASR, with equal settings of latency.
翻译:流式自动语音识别(ASR)对众多实际ASR应用至关重要。然而,流式ASR系统面临的一个显著挑战在于如何在操作性能与延迟约束之间取得平衡。近期,一种名为CUSIDE的分块、模拟未来上下文与解码方法被提出,用于基于连接时序分类(CTC)的流式ASR,该方法在降低延迟与高识别准确率之间取得了良好平衡。本文提出CUSIDE-T,成功将CUSIDE方法适配至基于循环神经网络Transducer(RNN-T)的ASR架构,而非原有的CTC架构。我们还在CUSIDE-T中集成了语言模型重评分以进一步提升准确率,同时仅引入微小的额外延迟。我们在AISHELL-1、WenetSpeech和SpeechIO数据集上进行了大量实验,对比了CUSIDE-T与U2++(两者均基于RNN-T)。U2++是现有基于分块的流式ASR方法的对应实现。实验表明,在相同延迟设置下,CUSIDE-T在流式ASR中实现了更优的准确率性能。