Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research.
翻译:Transformer及其衍生模型在计算机视觉、自然语言处理和语音处理等多样化任务中取得了成功。为降低Transformer中多头自注意力机制的计算复杂度,选择性状态空间模型(即Mamba)被提出作为替代方案。Mamba在自然语言处理和计算机视觉任务中展现出有效性,但其在语音信号处理中的优越性鲜少被研究。本文通过两个典型语音处理任务探索将Mamba应用于语音的解决方案:需要语义和时序信息的语音识别,以及主要关注时序模式的语音增强。结果表明,双向Mamba(BiMamba)在语音处理中优于原始Mamba。此外,实验证明了BiMamba作为Transformer及其衍生模型中自注意力模块替代方案的有效性,尤其在语义感知任务中。最后,通过消融研究和讨论部分总结了将Mamba迁移至语音的关键技术,为未来研究提供见解。