Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy and propose a forward process based on a Brownian bridge. We show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the baseline process with only half of the iteration steps and having one hyperparameter less to tune.
翻译:近年来,基于分数的生成模型已成功应用于语音增强任务。该模型利用随机微分方程描述迭代前向过程,每次迭代中向干净语音信号添加环境噪声和高斯白噪声。虽然极限情况下前向过程的均值会收敛至含噪混合信号,但实际应用中因提前停止,仅能近似含噪混合信号。这导致前向过程的终止分布与推理时逆向过程采用的先验分布之间存在差异。本文针对该差异提出基于布朗桥的前向过程,证明该过程相比传统扩散过程可有效降低先验失配。更重要的是,在仅需半数迭代步骤且减少一个超参数调节量的情况下,本方法在客观指标上显著优于基线过程。