Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.

翻译：相位信息对语音感知质量和清晰度具有重要影响。然而，由于相位具有非结构性和缠绕特性，现有语音增强方法在显式相位估计方面存在局限性，这成为增强语音质量提升的瓶颈。为解决上述问题，本文提出MP-SENet——一种并行显式增强幅值与相位谱的新型语音增强网络。所提出的MP-SENet采用集成Transformer的编码器-解码器架构。编码器将输入的失真幅值和相位谱编码为时频表征，随后输入时频Transformer以交替捕获时间和频率依赖关系。解码器包含幅值掩码解码器和相位解码器，分别通过幅值掩码架构和相位并行估计架构直接增强幅值谱和缠绕相位谱。采用在幅值谱、缠绕相位谱和短时复频谱上显式定义的多级损失函数联合训练MP-SENet模型，并进一步引入度量判别器以补偿这些损失与人类听觉感知之间的不完全相关性。实验结果表明，本文提出的MP-SENet在语音去噪、去混响和带宽扩展等多种语音增强任务中均达到最优性能。与现有相位感知语音增强方法相比，该方法通过显式相位估计进一步缓解了幅值与相位之间的补偿效应，提升了增强语音的感知质量。