This paper presents a new framework for diffusion-based speech enhancement. Our method employs a Schroedinger bridge to transform the noisy speech distribution into the clean speech distribution. To stabilize and improve training, we employ time-dependent scalings of the inputs and outputs of the network, known as preconditioning. We consider two skip connection configurations, which either include or omit the current process state in the denoiser's output, enabling the network to predict either environmental noise or clean speech. Each approach leads to improved performance on different speech enhancement metrics. To maintain stable magnitude levels and balance during training, we use a magnitude-preserving network architecture that normalizes all activations and network weights to unit length. Additionally, we propose learning the contribution of the noisy input within each network block for effective input conditioning. After training, we apply a method to approximate different exponential moving average (EMA) profiles and investigate their effects on the speech enhancement performance. In contrast to image generation tasks, where longer EMA lengths often enhance mode coverage, we observe that shorter EMA lengths consistently lead to better performance on standard speech enhancement metrics. Code, audio examples, and checkpoints are available online.
翻译:本文提出了一种新的基于扩散的语音增强框架。我们的方法采用薛定谔桥将含噪语音分布转换为纯净语音分布。为了稳定并改进训练,我们对网络的输入和输出采用时间相关的缩放,即预条件处理。我们考虑了两种跳跃连接配置:一种在去噪器输出中包含当前过程状态,另一种则将其省略,这使得网络能够预测环境噪声或纯净语音。每种方法都在不同的语音增强指标上带来了性能提升。为了在训练期间保持稳定的幅值水平和平衡,我们采用了一种保持幅值的网络架构,该架构将所有激活值和网络权重归一化为单位长度。此外,我们提出在每个网络块中学习含噪输入的贡献,以实现有效的输入条件化。训练完成后,我们应用一种方法来近似不同的指数移动平均(EMA)曲线,并研究它们对语音增强性能的影响。与图像生成任务中较长的EMA长度通常能提高模式覆盖率不同,我们观察到较短的EMA长度在标准语音增强指标上始终能带来更好的性能。代码、音频示例和检查点可在网上获取。