This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
翻译:本文研究流式自动语音识别(ASR)。Mamba是近期提出的一种状态空间模型,已在多种任务中展现出匹配或超越Transformer的能力,同时得益于其线性复杂度优势。我们探索了Mamba编码器在流式ASR中的效率,并提出了一种关联的前瞻机制以利用可控的未来信息。此外,本文实现了一种流式风格的单模态聚合方法,该方法能自动检测标记活动并流式触发标记输出,同时聚合特征帧以更好地学习标记表示。基于UMA,本文进一步提出了一种早期终止方法以降低识别延迟。在两个普通话数据集上进行的实验表明,所提模型在识别准确率与延迟方面均取得了具有竞争力的ASR性能。