Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech prefixing and the recently proposed Mamba-2 yields comparable performance to Transformer-based models on large datasets.
翻译:以Mamba为代表的选择性状态空间模型已在包括自动语音识别在内的多项任务中展现出计算高效性和良好性能。Mamba此前主要应用于基于注意力机制的编码器-解码器框架的ASR任务,其中编码器与解码器间的交叉注意力机制得以保留。本文探索了Mamba作为纯解码器架构在ASR任务中的潜力。我们提出的基于Mamba的仅解码器方法包含单一解码器,该解码器以语音标记为条件,以自回归方式预测文本标记。为增强MADEON性能,我们进一步提出语音前缀技术,对语音标记进行双向处理,从而丰富隐藏状态的上下文信息。实验表明,MADEON显著优于非选择性SSM模型。结合语音前缀技术与最新提出的Mamba-2架构,在大型数据集上取得了与基于Transformer模型相当的性能表现。