Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
翻译:近年来,解码器专用大型语言模型(LLM)在自动语音识别(ASR)领域展现出巨大潜力。然而,在此框架内实现流式识别仍面临挑战。本研究提出一种新颖的流式ASR方法,通过整合读/写策略网络与单调分块注意力机制(MoChA)实现语音嵌入的动态切分。训练过程中,这些切分段与标签序列交错排列,从而实现与LLM的无缝集成。在推理阶段,音频流持续缓冲直至MoChA模块触发读取信号,此时缓冲段与前一标记共同输入LLM进行下一标记预测。我们还引入了最小延迟训练目标,以引导策略网络实现精确的切分边界判定。此外,我们采用联合训练策略,使非流式LLM-ASR模型与我们的流式模型共享参数。在AISHELL-1和AISHELL-2普通话基准测试上的实验表明,我们的方法持续优于近期流式ASR基线模型,分别实现了5.1%和5.5%的字错误率。延迟优化使平均标记生成延迟降低62.5%,且对识别精度影响可忽略不计。