Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy conditions. Specifically, NASS views background noise as an additional output and predicts it along with other speakers in a mask-based manner. To effectively denoise, we introduce patch-wise contrastive learning (PCL) between noise and speaker representations from the decoder input and encoder output. PCL loss aims to minimize the mutual information between predicted noise and other speakers at multiple-patch level to suppress the noise information in separated signals. Experimental results show that NASS achieves 1 to 2dB SI-SNRi or SDRi over DPRNN and Sepformer on WHAM! and LibriMix noisy datasets, with less than 0.1M parameter increase.
翻译:最近,语音分离任务在深度学习技术的推动下取得了显著进展。然而,从含噪混合信号中分离出目标语音仍具有挑战性,因为神经网络模型容易将背景噪声分配给每个说话人。本文提出一种噪声感知语音分离方法,旨在改善噪声条件下分离信号的语音质量。具体而言,NASS将背景噪声视为额外输出,并通过掩码方式与其他说话人一同预测。为实现有效去噪,我们在解码器输入与编码器输出的噪声和说话人表征之间引入逐块对比学习。PCL损失旨在通过多块层级最小化预测噪声与其他说话人之间的互信息,从而抑制分离信号中的噪声信息。实验结果表明,在WHAM!和LibriMix噪声数据集上,相较于DPRNN和Sepformer,NASS在SI-SNRi或SDRi指标上提升1至2dB,且参数量增加不足0.1M。