Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise centered at noisy speech, and subsequently learn a parameterized model to reverse this process, conditionally on noisy speech. Unlike supervised methods, generative-based SE approaches usually rely solely on an unsupervised loss, which may result in less efficient incorporation of conditioned noisy speech. To address this issue, we propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech at each reverse process iteration. Experimental results demonstrate the effectiveness of our proposed methodology.
翻译:扩散生成模型近年来在语音增强(SE)领域受到关注,为传统监督方法提供了替代方案。这些模型将干净语音训练样本转化为以带噪语音为中心的加性高斯噪声,随后学习一个参数化模型,以带噪语音为条件来逆向这一过程。与监督方法不同,基于生成的SE方法通常仅依赖无监督损失,这可能影响对带噪语音条件信息的有效利用。为解决这一问题,我们提出在原始扩散训练目标中增加均方误差(MSE)损失,用于衡量每个逆向过程迭代中估计增强语音与真实干净语音之间的偏差。实验结果验证了我们所提方法的有效性。