Most deep noise suppression (DNS) models are trained with reference-based losses requiring access to clean speech. However, sometimes an additive microphone model is insufficient for real-world applications. Accordingly, ways to use real training data in supervised learning for DNS models promise to reduce a potential training/inference mismatch. Employing real data for DNS training requires either generative approaches or a reference-free loss without access to the corresponding clean speech. In this work, we propose to employ an end-to-end non-intrusive deep neural network (DNN), named PESQ-DNN, to estimate perceptual evaluation of speech quality (PESQ) scores of enhanced real data. It provides a reference-free perceptual loss for employing real data during DNS training, maximizing the PESQ scores. Furthermore, we use an epoch-wise alternating training protocol, updating the DNS model on real data, followed by PESQ-DNN updating on synthetic data. The DNS model trained with the PESQ-DNN employing real data outperforms all reference methods employing only synthetic training data. On synthetic test data, our proposed method excels the Interspeech 2021 DNS Challenge baseline by a significant 0.32 PESQ points. Both on synthetic and real test data, the proposed method beats the baseline by 0.05 DNSMOS points - although PESQ-DNN optimizes for a different perceptual metric.
翻译:大多数深度噪声抑制(DNS)模型依赖于参考信号的损失函数进行训练,这需要获取纯净语音。然而,在某些实际应用中,加性麦克风模型并不足以应对真实场景。因此,在DNS有监督学习中使用真实训练数据的方法有望减少训练与推理之间的不匹配。使用真实数据训练DNS模型需要生成式方法或无需对应纯净语音的无参考损失函数。本文提出一种名为PESQ-DNN的端到端非侵入式深度神经网络(DNN),用于估计增强后真实语音的感知语音质量评估(PESQ)分数。该网络提供了无参考的感知损失,可在DNS训练中利用真实数据,从而最大化PESQ分数。此外,我们采用逐周期交替训练协议:先以真实数据更新DNS模型,再以合成数据更新PESQ-DNN。使用PESQ-DNN并引入真实数据训练的DNS模型,其性能优于仅使用合成训练数据的所有参考方法。在合成测试数据上,本方法相较于Interspeech 2021 DNS挑战赛基线模型显著提升了0.32个PESQ分数。尽管PESQ-DNN针对的是不同的感知指标进行优化,在合成和真实测试数据上,本方法仍比基线模型高出0.05个DNSMOS分数。