Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.
翻译:扩散模型在语音增强任务中展现出显著潜力,通过任务适配的扩散过程实现从含噪混合信号中条件生成纯净语音。然而,在测试阶段,用于分数估计的神经网络需要多次迭代调用以完成逆向过程,这不仅导致推理速度缓慢,还会因采样轨迹上累积的离散化误差影响性能。针对上述问题,本文提出一种两阶段训练方法:第一阶段采用生成式去噪分数匹配损失常规训练扩散模型;第二阶段通过求解逆向过程生成增强信号,并利用预测损失将估计结果与纯净语音目标进行对比。实验表明,该训练方案仅需5次函数评估即可达到基线模型60次函数评估的同等性能。传统生成扩散算法在通过降低函数评估次数实现单步扩散时性能显著下降,而本文方法在此场景下仍能保持稳定性能,大幅超越扩散基线模型,且泛化能力优于纯预测式模型。