The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a training and inference strategy that additionally uses enhanced speech as a target by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing 1) the teacher model's estimated speech and noise for enhanced-target training or 2) raw noisy speech and the teacher model's estimated noise for noisy-target training. Experimental results show that our proposed method outperforms several baselines, especially with the teacher/student inference, where predicted clean speech is derived successively through the teacher and final student models.
翻译:干净语音的缺乏是语音增强系统开发中面临的实际挑战,这意味着其训练准则与评估指标之间存在不可避免的失配。针对这一不利情况,我们提出了一种训练与推理策略,通过改进先前提出的含噪目标训练(NyTT)方法,额外使用增强语音作为训练目标。由于域内噪声与域外噪声的同质性对NyTT的有效性至关重要,我们通过重新混合以下两种方式训练多个学生模型:1)教师模型估计的语音和噪声以进行增强目标训练;2)原始含噪语音和教师模型估计的噪声以进行含噪目标训练。实验结果表明,我们提出的方法在多个基线上表现更优,尤其在教师/学生推理中,预测的干净语音通过教师模型和最终学生模型依次推导得出。