Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties of the sampling process from cold diffusion could be utilized to restore high-quality samples from arbitrary degradations. Based on these properties, we propose an improved training algorithm and objective to help the model generalize better during the sampling process. We verify our proposed framework by investigating two model architectures. Experimental results on benchmark speech enhancement dataset VoiceBank-DEMAND demonstrate the strong performance of the proposed approach compared to representative discriminative models and diffusion-based enhancement models.
翻译:扩散模型近期在困难增强任务中展现出显著成效,例如自然图像与音频信号的条件及无条件复原。本研究探索利用最新提出的高级迭代扩散模型——即冷扩散——从噪声信号中恢复干净语音信号的可能性。冷扩散采样过程的独特数学特性可被用于从任意退化中重建高质量样本。基于这些特性,我们提出了一种改进的训练算法与优化目标,以帮助模型在采样过程中获得更好的泛化能力。我们通过研究两种模型架构对提出的框架进行了验证。在基准语音增强数据集VoiceBank-DEMAND上的实验结果表明,与代表性判别模型及基于扩散的增强模型相比,本方法展现了强劲性能。