End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
翻译:端到端(E2E)自动语音识别(ASR)可在两种模式下运行:流式与非流式,二者各有优劣。流式ASR在接收语音帧时实时处理,而非流式ASR则需等待完整语音语句;因此,专业人员可能需根据应用需求切换运行模式。本研究提出基于多解码器与知识蒸馏的流式与非流式ASR联合优化方法。核心研究包括:1)这些ASR模块的编码器集成,继而2)采用独立解码器实现灵活的模式切换,并通过3)在两个模块化编码器与解码器间引入保持相似性的知识蒸馏以提升性能。评估结果表明,与多个独立模块相比,单一模型在CSJ数据集上实现了流式ASR 2.6%-5.3%的相对字错误率降低(CERR),以及非流式ASR 8.3%-9.7%的相对CERR提升。