Transformers have been the most successful architecture for various speech modeling tasks, including speech separation. However, the self-attention mechanism in transformers with quadratic complexity is inefficient in computation and memory. Recent models incorporate new layers and modules along with transformers for better performance but also introduce extra model complexity. In this work, we replace transformers with Mamba, a selective state space model, for speech separation. We propose dual-path Mamba, which models short-term and long-term forward and backward dependency of speech signals using selective state spaces. Our experimental results on the WSJ0-2mix data show that our dual-path Mamba models of comparably smaller sizes outperform state-of-the-art RNN model DPRNN, CNN model WaveSplit, and transformer model Sepformer. Code: https://github.com/xi-j/Mamba-TasNet
翻译:Transformer已成为包括语音分离在内的多种语音建模任务中最成功的架构。然而,Transformer中的自注意力机制具有二次复杂度,在计算和内存方面效率低下。近期模型通过引入新型层和模块与Transformer结合以提升性能,但同时也增加了模型复杂度。本研究提出用Mamba(一种选择性状态空间模型)替代Transformer进行语音分离。我们提出的双路径Mamba利用选择性状态空间对语音信号的短时和长时前向-后向依赖关系进行建模。在WSJ0-2mix数据集上的实验结果表明,我们规模较小的双路径Mamba模型优于最先进的循环神经网络模型DPRNN、卷积神经网络模型WaveSplit和Transformer模型Sepformer。代码地址:https://github.com/xi-j/Mamba-TasNet