Speech enhancement aims to improve speech quality and intelligibility, especially in noisy environments where background noise degrades speech signals. Currently, deep learning methods achieve great success in speech enhancement, e.g. the representative convolutional recurrent neural network (CRN) and its variants. However, CRN typically employs consecutive downsampling and upsampling convolution for frequency modeling, which destroys the inherent structure of the signal over frequency. Additionally, convolutional layers lacks of temporal modelling abilities. To address these issues, we propose an innovative module combing a State space model and Inplace Convolution (SIC), and to replace the conventional convolution in CRN, called SICRN. Specifically, a dual-path multidimensional State space model captures the global frequencies dependency and long-term temporal dependencies. Meanwhile, the 2D-inplace convolution is used to capture the local structure, which abandons the downsampling and upsampling. Systematic evaluations on the public INTERSPEECH 2020 DNS challenge dataset demonstrate SICRN's efficacy. Compared to strong baselines, SICRN achieves performance close to state-of-the-art while having advantages in model parameters, computations, and algorithmic delay. The proposed SICRN shows great promise for improved speech enhancement.
翻译:语音增强旨在提升语音质量和可懂度,特别是在背景噪声会劣化语音信号的嘈杂环境中。目前,深度学习方法在语音增强领域取得了巨大成功,例如具有代表性的卷积循环神经网络(CRN)及其变体。然而,CRN通常采用连续下采样和上采样卷积进行频率建模,这会破坏信号在频率上的固有结构。此外,卷积层缺乏时间建模能力。为解决这些问题,我们提出了一种结合状态空间模型与就地卷积的创新模块(SIC),并将其替代CRN中的传统卷积,称为SICRN。具体而言,双路径多维状态空间模型可捕获全局频率依赖性和长时时间依赖性;同时,二维就地卷积用于捕获局部结构,该方法摒弃了传统的下采样和上采样操作。在公开的INTERSPEECH 2020 DNS挑战赛数据集上的系统性评估验证了SICRN的有效性。与强基线方法相比,SICRN在模型参数量、计算量和算法延迟方面具有优势,同时性能接近当前最优水平。所提出的SICRN展现出改善语音增强效果的巨大潜力。