In this paper, we present a novel diffusion model-based monaural speech enhancement method. Our approach incorporates the separate estimation of speech spectra's magnitude and phase in two diffusion networks. Throughout the diffusion process, noise clips from real-world noise interferences are added gradually to the clean speech spectra and a noise-aware reverse process is proposed to learn how to generate both clean speech spectra and noise spectra. Furthermore, to fully leverage the intrinsic relationship between magnitude and phase, we introduce a complex-cycle-consistent (CCC) mechanism that uses the estimated magnitude to map the phase, and vice versa. We implement this algorithm within a phase-aware speech enhancement diffusion model (SEDM). We conduct extensive experiments on public datasets to demonstrate the effectiveness of our method, highlighting the significant benefits of exploiting the intrinsic relationship between phase and magnitude information to enhance speech. The comparison to conventional diffusion models demonstrates the superiority of SEDM.
翻译:本文提出一种新颖的基于扩散模型的单声道语音增强方法。我们的方法通过两个扩散网络分别估计语音频谱的幅度和相位。在整个扩散过程中,来自真实噪声干扰的噪声片段被逐步添加到纯净语音频谱中,并提出一种噪声感知的反向过程来学习如何同时生成纯净语音频谱和噪声频谱。此外,为充分利用幅度与相位之间的内在关联,我们引入了一种复循环一致机制,该机制利用估计的幅度来映射相位,反之亦然。我们在相位感知语音增强扩散模型框架中实现了该算法。通过在公开数据集上的大量实验,我们验证了所提方法的有效性,特别强调了利用相位与幅度信息间内在关联对语音增强的显著益处。与传统扩散模型的对比实验证明了本模型的优越性。