Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, our refiner effectively mixes clean parts newly generated via denoising diffusion restoration into the degraded and distorted parts caused by a preceding SE method, resulting in refined speech. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used.
翻译:摘要:尽管基于深度神经网络(DNN)的语音增强(SE)方法优于以往的非DNN方法,但它们通常会降低生成输出的感知质量。为解决这一问题,我们提出了一种基于DNN的生成式精炼器——Diffiner,旨在提升经SE方法预处理后的语音感知质量。我们仅利用包含纯净语音的数据集训练一个基于扩散的生成模型。随后,我们的精炼器将通过去噪扩散恢复新生成的纯净部分,有效混合到由先前SE方法引起的退化与失真部分中,从而得到精炼后的语音。一旦精炼器在纯净语音集上完成训练,即可应用于多种SE方法,无需针对每个SE模块进行专门训练。因此,我们的精炼器可作为SE方法的通用后处理模块,在模块化方面具有高潜力。实验结果表明,无论采用何种前置SE方法,我们的方法均能改善语音的感知质量。