We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.
翻译:我们利用薛定谔桥公式研究基于扩散的语音增强,并将EDM2框架扩展至该场景。我们采用时间相关的网络输入与输出预调节以稳定训练,并探索了两种跳跃连接配置,使网络能够预测环境噪声或纯净语音。为控制激活值和权重幅度,我们采用了一种幅度保持架构,并在每个网络块中学习含噪输入的贡献以改进条件化。我们进一步通过训练后近似不同的指数移动平均(EMA)曲线来分析EMA参数平滑的影响,发现与图像生成不同,短时或未使用EMA始终能获得更好的语音增强性能。在VoiceBank-DEMAND和EARS-WHAM数据集上的实验展示了具有竞争力的信号失真比和感知分数,两种跳跃连接变体表现出互补优势。这些发现为基于扩散的语音增强中的EMA行为、幅度保持及跳跃连接设计提供了新的见解。