We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.
翻译:本研究采用薛定谔桥公式研究基于扩散的语音增强,并将EDM2框架扩展至此场景。我们使用时变预条件处理网络输入与输出以稳定训练,并探索两种跳跃连接配置,使网络能够预测环境噪声或纯净语音。为控制激活值与权重幅度,我们采用幅度保持架构,并在每个网络块中学习含噪输入的贡献以改进条件化。我们进一步通过训练后近似不同EMA曲线来分析指数移动平均参数平滑的影响,发现与图像生成不同,短时或缺失的EMA在语音增强任务中始终能获得更优性能。在VoiceBank-DEMAND和EARS-WHAM数据集上的实验显示出具有竞争力的信号失真比与感知评分,两种跳跃连接变体展现出互补优势。这些发现为基于扩散的语音增强中的EMA行为、幅度保持及跳跃连接设计提供了新的见解。