Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.
翻译:近期语音增强领域的研究表明,结合Mamba与注意力机制的模型在跨语料库泛化性能上表现优异。与此同时,将Mamba集成于U-Net架构中不仅实现了最先进的增强性能,同时显著降低了模型参数量与计算复杂度。受这些发现启发,我们提出RWSA-MambaUNet——一种在U-Net架构中融合Mamba与多头注意力的新型高效混合模型,旨在提升跨语料库性能。分辨率共享注意力(RWSA)指在对应时频分辨率层级间共享注意力权重。我们性能最优的RWSA-MambaUNet模型在两个域外测试集上取得了最先进的泛化性能。值得注意的是,我们最小的模型在域外DNS 2020测试集上于PESQ、SSNR和ESTOI指标全面超越所有基线模型,在域外EARS-WHAM_v2测试集上于SSNR、ESTOI和SI-SDR指标同样优于所有基线,而所用参数量不足基线模型的一半,计算量(FLOPs)仅为基线的一小部分。