In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
翻译:本文提出了一种基于FastConformer架构的高效准确流式语音识别模型。我们通过以下方式使FastConformer架构适用于流式应用:(1)约束编码器中的前视和后视上下文;(2)引入激活缓存机制,使非自回归编码器在推理时能够以自回归方式运行。所提出的模型经过精心设计,消除了许多流式模型常见的训练与推理时间之间的精度差异。此外,我们的编码器可与多种解码器配置协同工作,包括连接时序分类(CTC)和循环神经网络变换器(RNNT)解码器。我们进一步引入了一种混合CTC/RNNT架构,该架构利用共享编码器同时连接CTC和RNNT解码器,以提升精度并节省计算量。我们在LibriSpeech数据集和多领域大规模数据集上评估了该模型,结果表明,与传统的缓冲流式模型基线相比,该模型能以更低的延迟和推理时间实现更高的精度。我们还证明,训练具有多种延迟的模型能够获得比单一延迟模型更高的精度,同时支持使用单个模型实现多种延迟。我们的实验还表明,与单一解码器模型相比,混合架构不仅加快了CTC解码器的收敛速度,还提升了流式模型的精度。