Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy conditions. Specifically, NASS views background noise as an independent output and predicts it with other speakers in a mask-based manner. Then we conduct patch-wise contrastive learning on feature level to minimize the mutual information between the predicted noise output and other speakers, which suppresses the noise information in separated signals, and vice versa. Experimental results show that NASS could achieve competitive results on different datasets, and significantly improve the noise-robustness for different mask-based SS backbones with less than 0.1M parameter increase.
翻译:近年来,受深度学习技术驱动,语音分离任务取得了显著进展。然而,从含噪混合信号中分离目标语音仍具挑战性,因为神经网络模型易将背景噪声分配给各说话人。本文提出一种噪声感知语音分离方法(NASS),旨在改善含噪条件下分离信号的语音质量。具体而言,NASS将背景噪声视为独立输出,并基于掩蔽机制将其与其他说话人信号一同预测。随后,我们在特征层面进行分段对比学习,以最小化预测噪声输出与其他说话人间的互信息,从而抑制分离信号中的噪声信息,反之亦然。实验结果表明,NASS在不同数据集上均能取得具有竞争力的结果,且在参数增量小于0.1M的情况下,显著提升了不同掩蔽类语音分离主干的抗噪鲁棒性。