Current speech enhancement (SE) research has largely neglected channel attention and spatial attention, and encoder-decoder architecture-based networks have not adequately considered how to provide efficient inputs to the intermediate enhancement layer. To address these issues, this paper proposes a time-frequency (T-F) domain SE network (DPCFCS-Net) that incorporates improved densely connected blocks, dual-path modules, convolution-augmented transformers (conformers), channel attention, and spatial attention. Compared with previous models, our proposed model has a more efficient encoder-decoder and can learn comprehensive features. Experimental results on the VCTK+DEMAND dataset demonstrate that our method outperforms existing techniques in SE performance. Furthermore, the improved densely connected block and two dimensions attention module developed in this work are highly adaptable and easily integrated into existing networks.
翻译:当前语音增强(SE)研究大多忽略了通道注意力与空间注意力,且基于编解码器架构的网络未能充分考虑如何为中间增强层提供高效输入。针对这些问题,本文提出一种融合改进密集连接块、双路径模块、卷积增强Transformer(Conformer)、通道注意力与空间注意力的时频域语音增强网络(DPCFCS-Net)。与现有模型相比,所提模型具有更高效的编解码结构,能够学习综合特征。在VCTK+DEMAND数据集上的实验结果表明,本方法在语音增强性能上优于现有技术。此外,本文提出的改进密集连接块和二维注意力模块具有高度适配性,可轻易集成至现有网络中。