Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four.
翻译:扩散模型是一类新型生成模型,在图像生成领域展现出卓越性能。因此,已有研究尝试将扩散模型应用于其他任务,例如语音增强。将扩散模型适配至语音增强任务的一种主流方法是建模干净语音与带噪语音之间的渐进变换过程。然而,此前图像生成领域确立的一种流行扩散模型框架并未考虑这种面向系统输入的变换,这导致现有基于扩散的语音增强系统与该框架难以建立关联。为解决该问题,我们对该框架进行扩展,使其能够刻画干净语音与带噪语音之间的渐进变换。这使我们能够借鉴图像生成领域的最新进展,系统性地探究扩散模型中在语音增强任务中尚未充分研究的设计因素,包括神经网络预条件化、训练损失加权、随机微分方程(SDE)以及反向过程中注入的随机性程度。研究结果表明,先前的基于扩散的语音增强系统的性能提升并非源于干净语音与带噪语音之间的渐进变换。此外,我们发现,通过合理选择预条件化策略、训练损失加权、SDE及采样器,可在使用更少采样步数的条件下,在感知指标上超越一种主流的基于扩散的语音增强系统,从而将计算成本降低至四分之一。