Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness. Specifically, we first build a unified network by combining speech enhancement (SE) and separation modules, with multi-task learning for optimization, where SE is supervised by parallel clean mixture to reduce noise for downstream speech separation. Furthermore, in order to avoid suppressing valid speaker information when reducing noise, we propose a gradient modulation (GM) strategy to harmonize the SE and SS tasks from optimization view. Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets, with SI-SNRi results of 16.0 dB and 15.8 dB respectively. Our code is available at GitHub.
翻译:近期基于神经网络的单声道语音分离研究得益于长序列建模能力的提升取得了显著成功。然而,在现实噪声环境下,背景噪声可能被误判为说话人语音,从而干扰分离后的声源,导致模型性能大幅下降。为解决这一问题,我们提出了一种新型网络,通过梯度调制统一语音增强与分离任务,以提升抗噪鲁棒性。具体而言,我们首先构建了一个联合语音增强与分离模块的统一网络,并采用多任务学习进行优化,其中语音增强模块以并行纯净混合语音为监督信号,为下游语音分离任务降低噪声。此外,为在降噪过程中避免抑制有效说话人信息,我们提出了一种梯度调制策略,从优化视角协调语音增强与语音分离任务。实验结果表明,我们的方法在大规模Libri2Mix-Noisy和Libri3Mix-Noisy数据集上达到了最优性能,SI-SNRi结果分别达到16.0 dB和15.8 dB。我们的代码已在GitHub上开源。