We present an efficient speech separation neural network, ARFDCN, which combines dilated convolutions, multi-scale fusion (MSF), and channel attention to overcome the limited receptive field of convolution-based networks and the high computational cost of transformer-based networks. The suggested network architecture is encoder-decoder based. By using dilated convolutions with gradually increasing dilation value to learn local and global features and fusing them at adjacent stages, the model can learn rich feature content. Meanwhile, by adding channel attention modules to the network, the model can extract channel weights, learn more important features, and thus improve its expressive power and robustness. Experimental results indicate that the model achieves a decent balance between performance and computational efficiency, making it a promising alternative to current mainstream models for practical applications.
翻译:我们提出了一种高效的语音分离神经网络ARFDCN,该网络融合了扩张卷积、多尺度融合(MSF)与通道注意力机制,以克服基于卷积网络感受野受限以及基于Transformer网络计算成本高昂的问题。所提出的网络架构采用编码器-解码器结构。通过使用扩张值逐渐增大的扩张卷积来学习局部与全局特征,并在相邻阶段进行特征融合,模型能够学习丰富的特征内容。同时,在网络中添加通道注意力模块后,模型可提取通道权重并学习更重要的特征,从而提升其表达能力与鲁棒性。实验结果表明,该模型在性能与计算效率之间取得了良好平衡,使其成为实际应用中替代当前主流模型的可行方案。