Mamba, a selective state-space model (SSM), has emerged as an efficient alternative to Transformers for speech modeling, enabling long-sequence processing with linear complexity. While effective in speech separation, existing approaches, whether in the time or time-frequency domain, typically decompose the input along a single dimension into short one-dimensional sequences before processing them with Mamba, which restricts it to local 1D modeling and limits its ability to capture global dependencies across the 2D spectrogram. In this work, we propose an efficient omni-directional attention (OA) mechanism built upon unidirectional Mamba, which models global dependencies from ten different directions on the spectrogram. We expand the proposed mechanism into two baseline separation models and evaluate on three public datasets. Experimental results show that our approach consistently achieves significant performance gains over the baselines while preserving linear complexity, outperforming existing state-of-the-art (SOTA) systems.
翻译:Mamba作为一种选择性状态空间模型(SSM),已成为语音建模中Transformer的高效替代方案,能以线性复杂度处理长序列。尽管在语音分离中表现有效,现有方法(无论是在时域还是时频域)通常将输入沿单一维度分解为短的一维序列,再使用Mamba进行处理,这将其限制于局部一维建模,并削弱了其捕捉二维频谱图全局依赖关系的能力。本研究提出一种基于单向Mamba的高效全向注意力机制,该机制能从频谱图的十个不同方向建模全局依赖关系。我们将所提机制扩展至两个基线分离模型,并在三个公开数据集上进行评估。实验结果表明,我们的方法在保持线性复杂度的同时,始终较基线模型取得显著性能提升,且优于现有的最先进系统。