Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target signals from noisy mixture, as neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS method called NASS, which aims to improve the speech quality of separated signals in noisy conditions. Specifically, NASS views background noise as an independent speaker and predicts it with other speakers in a mask-based manner. Then we conduct patch-wise contrastive learning on feature level to minimize the mutual information between the predicted noise-speaker and other speakers, which suppresses the noise information in separated signals. The experimental results show that NASS effectively improves the noise-robustness for different mask-based separation backbones with less than 0.1M parameter increase. Furthermore, SI-SNRi results demonstrate that NASS achieves state-of-the-art performance on WHAM! dataset.
翻译:近期,语音分离任务借助深度学习技术取得了显著进展。然而,从含噪混合信号中分离目标信号仍具挑战性,因为神经网络模型容易将背景噪声分配到每个说话人。本文提出一种名为NASS的噪声感知语音分离方法,旨在提升噪声环境下分离信号的语音质量。具体而言,NASS将背景噪声视为独立说话人,并基于掩码机制将其与其他说话人一同预测。随后,我们在特征层面进行逐块对比学习,以最小化预测的噪声说话人与其他说话人之间的互信息,从而抑制分离信号中的噪声成分。实验结果表明,NASS在参数增加不足0.1M的情况下,能有效增强不同掩码分离骨干网络的噪声鲁棒性。此外,SI-SNRi指标显示,NASS在WHAM!数据集上取得了当前最优性能。