In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
翻译:本文提出了一种基于FastConformer架构的高效准确流式语音识别模型。我们通过以下方式将FastConformer架构适配于流式应用:(1) 限制编码器中的前瞻上下文和过去上下文;(2) 引入激活缓存机制,使非自回归编码器在推理过程中能够以自回归方式运行。该模型经过精心设计,消除了许多流式模型常见的训练与推理阶段的精度差异。此外,我们提出的编码器可与多种解码器配置协同工作,包括连接时序分类(CTC)和RNN-Transducer(RNNT)解码器。同时,我们引入了一种混合CTC/RNNT架构,该架构使用共享编码器同时连接CTC和RNNT解码器,以提升精度并节省计算量。我们在LibriSpeech数据集和多领域大规模数据集上评估了所提模型,结果表明,与传统缓冲流式模型基线相比,该模型能以更低的延迟和推理时间实现更高的精度。我们还证明,采用多延迟训练的模型比单延迟模型能达到更高精度,同时支持用单一模型适配多种延迟。实验进一步表明,与单解码器模型相比,混合架构不仅能加速CTC解码器的收敛,还能提升流式模型的精度。