Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.
翻译:自动语音识别(ASR)技术取得了显著进展,OpenAI Whisper和NVIDIA Canary等模型在离线转写任务中达到了最先进的性能水平。然而,由于架构和训练方法的局限性,这些模型并非为流式(在线或实时)转写而设计。我们提出了一种将Transformer编码器-解码器模型转化为低延迟流式模型的方法。编码器被改造为因果结构以实现增量式音频处理,而解码器则基于部分编码器状态生成与当前时间上下文对齐的令牌。这要求编码输入帧与令牌生成之间实现显式同步。由于令牌仅在获取充分声学证据后才会产生,因此存在固有延迟,需要对编码器-解码器对齐机制进行微调。我们提出了一种改进的推理机制,利用微调后的因果编码器和解码器实现贪心搜索和集束搜索解码,并证明该机制具有局部最优性。在低延迟分块(小于300毫秒)实验中的结果表明,我们的微调模型在多数情况下优于现有未经微调的流式方法,且计算复杂度更低。我们公开了训练和推理代码及微调后的模型,以支持流式ASR领域的进一步研究与开发。