Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder (CRED) structure to boost feature representation along the frequency axis. The CRED applies frequency recurrence on 3D convolutional feature maps along the frequency axis following each convolution, therefore, it is capable of catching long-range frequency correlations and enhancing feature representations of speech inputs. The proposed frequency recurrence is realized efficiently using a feedforward sequential memory network (FSMN). Besides the CRED, we insert two stacked FSMN layers between the encoder and the decoder to model further temporal dynamics. We name the proposed framework as Frequency Recurrent CRN (FRCRN). We design FRCRN to predict complex Ideal Ratio Mask (cIRM) in complex-valued domain and optimize FRCRN using both time-frequency-domain and time-domain losses. Our proposed approach achieved state-of-the-art performance on wideband benchmark datasets and achieved 2nd place for the real-time fullband track in terms of Mean Opinion Score (MOS) and Word Accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression (DNS) challenge (https://github.com/modelscope/ClearerVoice-Studio).
翻译:卷积循环网络(CRN)通过整合卷积编码器-解码器(CED)结构与循环结构,在单声道语音增强任务中取得了显著性能。然而,由于CED中卷积操作的感受野有限,跨频率上下文的特征表示受到严重制约。本文提出一种卷积循环编码器-解码器(CRED)结构,以增强沿频率轴的特征表示能力。CRED在每次卷积操作后,沿频率轴对三维卷积特征图施加频率循环机制,从而能够捕获长程频率相关性并增强语音输入的特征表示。所提出的频率循环通过前馈序列记忆网络(FSMN)高效实现。除CRED外,我们在编码器与解码器之间插入两个堆叠的FSMN层以进一步建模时序动态特性。我们将该框架命名为频率循环卷积循环网络(FRCRN)。FRCRN被设计用于在复数域预测复数理想比值掩码(cIRM),并同时使用时频域损失函数与时域损失函数进行优化。在宽带基准数据集上,该方法取得了最先进的性能,并在ICASSP 2022深度噪声抑制(DNS)挑战赛(https://github.com/modelscope/ClearerVoice-Studio)的实时全频带赛道中,以平均意见得分(MOS)和词准确率(WAcc)指标获得第二名。