In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
翻译:本文提出了一种基于FastConformer架构的高效且准确的流式语音识别模型。我们通过以下方式对FastConformer架构进行流式应用适配:(1) 限制编码器中的前瞻上下文与历史上下文范围;(2) 引入激活缓存机制,使非自回归编码器在推理阶段能够以自回归方式运行。该模型经过精心设计,消除了许多流式模型中常见的训练与推理阶段的精度差异。此外,我们提出的编码器可适配多种解码器配置,包括连接主义时序分类(CTC)和RNN-换能器(RNNT)解码器。同时,我们还引入了一种混合CTC/RNNT架构,该架构采用共享编码器,同时配备CTC和RNNT解码器,以提升精度并节省计算量。我们在LibriSpeech数据集及多领域大规模数据集上对所提模型进行了评估,结果表明,与传统缓冲式流式模型基线相比,该模型能够在降低延迟和推理时间的同时实现更高的精度。我们还证明,使用多延迟训练模型相比单延迟模型可获得更高的精度,同时可用单一模型支持多种延迟。实验进一步表明,与单解码器模型相比,混合架构不仅加快了CTC解码器的收敛速度,还提升了流式模型的精度。