MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.

翻译：随着Mamba和xLSTM等新型序列模型的出现，多项研究表明这些模型在单通道语音增强和音频表征学习任务中已达到或超越了现有最佳性能。然而，先前研究已证明LSTM和Mamba等序列模型容易对训练集产生过拟合。为解决此问题，已有工作表明在LSTM中加入自注意力机制能显著提升单通道语音增强的泛化性能。但迄今为止，混合Mamba与时频注意力模型的概念及其在语音增强中的泛化性能尚未得到探索。本文提出一种新颖的混合架构MambAttention，该架构结合了Mamba与共享的时域和频域多头注意力模块，用于实现可泛化的单通道语音增强。为训练模型，我们构建了VB-DemandEx数据集——该数据集受VoiceBank+Demand启发，但包含更具挑战性的噪声类型和更低的信噪比。在VB-DemandEx上训练的MambAttention，在两个域外数据集（无混响的DNS 2020和EARS-WHAM_v2）的所有评估指标上，均显著优于现有同复杂度水平的判别式系统（包括基于LSTM、xLSTM、Mamba和Conformer的系统）。在泛化性能方面，MambAttention与生成式扩散模型相当或更优，同时与语言模型基线保持竞争力。消融实验揭示了时域与频域多头注意力模块间权重共享对泛化性能的重要作用。最后，我们探索了将共享时频多头注意力模块与LSTM及xLSTM集成的方案，该方案在域外数据集上带来了显著性能提升，但MambAttention在所有报告评估指标上仍保持最佳的跨语料库泛化能力。