Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy. To this end, we propose a forward process based on a Brownian bridge and show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the baseline process with only half of the iteration steps and having one hyperparameter less to tune.
翻译:近年来,基于分数的生成模型已成功应用于语音增强任务。此类方法利用随机微分方程建模迭代前向过程,其中每个步骤向纯净语音信号中同时添加环境噪声和高斯白噪声。尽管前向过程在极限情况下会终止于带噪混合信号,但由于实际应用中提前终止,其最终分布仅能近似达到带噪混合信号状态。这导致前向过程的终止分布与推理时用于求解逆向过程的先验分布之间存在偏差。本文针对该偏差提出解决方案:首先设计基于布朗桥的前向过程,并证明该过程相比传统扩散过程能有效降低先验失配;更重要的是,实验表明本方法仅需基线过程一半的迭代步数且减少一个超参数调优,即可在客观指标上取得更优的增强效果。