Convolutional neural networks (CNN) and Transformer have wildly succeeded in multimedia applications. However, more effort needs to be made to harmonize these two architectures effectively to satisfy speech enhancement. This paper aims to unify these two architectures and presents a Parallel Conformer for speech enhancement. In particular, the CNN and the self-attention (SA) in the Transformer are fully exploited for local format patterns and global structure representations. Based on the small receptive field size of CNN and the high computational complexity of SA, we specially designed a multi-branch dilated convolution (MBDC) and a self-channel-time-frequency attention (Self-CTFA) module. MBDC contains three convolutional layers with different dilation rates for the feature from local to non-local processing. Experimental results show that our method performs better than state-of-the-art methods in most evaluation criteria while maintaining the lowest model parameters.
翻译:卷积神经网络(CNN)与Transformer在多媒体应用中取得了巨大成功。然而,在语音增强任务中,如何有效协调这两种架构仍需更多探索。本文旨在统一这两种架构,提出一种用于语音增强的并行卷积网络(Parallel Conformer)。具体而言,我们充分利用CNN的局部模式处理能力和Transformer中的自注意力机制(SA)进行全局结构表征。针对CNN感受野较小和SA计算复杂度高的问题,我们专门设计了多分支膨胀卷积(MBDC)模块和自通道-时间-频率注意力(Self-CTFA)模块。MBDC包含三个具有不同膨胀率的卷积层,可实现对从局部到非局部特征的渐进处理。实验结果表明,在多数评价指标上,本方法在保持最低模型参数量的同时,性能优于当前最先进的方法。