In recent speech enhancement (SE) research, transformer and its variants have emerged as the predominant methodologies. However, the quadratic complexity of the self-attention mechanism imposes certain limitations on practical deployment. Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision due to its strong capabilities in modeling long sequences and relatively low computational complexity. In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks. By leveraging bidirectional Mamba to model forward and backward dependencies of speech signals at different resolutions, and incorporating skip connections to capture multi-scale information, our approach achieves state-of-the-art (SOTA) performance. Experimental results on the VCTK+DEMAND dataset indicate that Mamba-SEUNet attains a PESQ score of 3.59, while maintaining low computational complexity. When combined with the Perceptual Contrast Stretching technique, Mamba-SEUNet further improves the PESQ score to 3.73.
翻译:近年来,在语音增强研究中,Transformer及其变体已成为主流方法。然而,自注意力机制的二次复杂度在实际部署中带来了一定的限制。Mamba作为一种新颖的状态空间模型,因其在长序列建模方面的强大能力以及相对较低的计算复杂度,已在自然语言处理和计算机视觉领域得到广泛应用。在本工作中,我们提出了Mamba-SEUNet,这是一种将Mamba与U-Net相结合用于语音增强任务的创新架构。通过利用双向Mamba对不同分辨率下语音信号的前向与后向依赖关系进行建模,并结合跳跃连接以捕获多尺度信息,我们的方法实现了最先进的性能。在VCTK+DEMAND数据集上的实验结果表明,Mamba-SEUNet取得了3.59的PESQ分数,同时保持了较低的计算复杂度。当结合感知对比度拉伸技术时,Mamba-SEUNet进一步将PESQ分数提升至3.73。