Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.
翻译:扩散模型在语音增强中已展现出显著成果,其通过任务适配的扩散过程实现基于带噪混合信号的条件式纯净语音生成。然而在测试阶段,用于分数估计的神经网络需多次迭代求解逆向过程,导致推理速度缓慢,并产生沿采样轨迹累积的离散化误差。本文提出两阶段训练方法解决上述局限:第一阶段采用生成式去噪分数匹配损失常规训练扩散模型;第二阶段通过求解逆向过程获取增强信号,并利用预测损失将所得估计结果与纯净语音目标进行对比。实验表明,采用该第二阶段训练后,仅需5次函数评估即可达到基线模型60次函数评估的同等性能。当采用单步扩散降低函数评估次数(NFEs)时,常规生成式扩散算法性能显著下降,而本文方法仍能保持稳定性能,在该设定下大幅优于扩散基线模型,且其泛化能力亦优于对应的预测性方法。