The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to effectively exploit time-frequency domain information. By combining the DPRNN module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising performance in speech separation with a limited model size. In this paper, we explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement. We use self-attention modules to exploit the long-time information, where the intra-chunk self-attentions are used to model the spectrum pattern and the inter-chunk self-attention are used to model the dependence between consecutive frames. Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation, which is more suitable for long sequences of speech signals. In addition, we propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network. Experiments show that with only 0.23M parameters, the proposed model achieves a better performance than DPCRN.
翻译:双路径卷积递归网络(DPCRN)被提出用于有效利用时频域信息。通过将DPRNN模块与卷积递归网络(CRN)相结合,DPCRN在模型规模受限的情况下在语音分离任务中取得了良好性能。本文探索了DPCRN模块中的自注意力机制,并设计了一种名为基于时频注意力的多损失卷积网络(MNTFA)的语音增强模型。我们利用自注意力模块挖掘长时信息:其中块内自注意力用于建模频谱特征,块间自注意力用于建模连续帧间的依赖关系。与DPRNN相比,轴向自注意力显著降低了内存与计算需求,更适用于语音信号的长序列处理。此外,我们提出了一种联合训练方法,融合了多分辨率短时傅里叶变换损失与基于预训练WavLM网络的WavLM损失。实验表明,所提模型仅需0.23M参数即可取得优于DPCRN的性能表现。