Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.

翻译：相位信息对语音感知质量和可理解性具有显著影响。然而，现有语音增强方法由于相位的非结构特性和包裹特性，在显式相位估计中面临局限，导致增强语音质量遭遇瓶颈。为解决上述问题，本文提出MP-SENet——一种新型语音增强网络，该网络并行显式增强幅度谱和相位谱。所提MP-SENet采用编解码架构，其中编码器与解码器通过沿时间和频率维度的时频Transformer连接。编码器旨在对输入失真幅度谱和相位谱导出的时频表示进行编码。解码器包含双流幅度解码器和相位解码器，分别通过幅度估计架构和相位并行估计架构直接增强幅度谱和包裹相位谱。为有效训练MP-SENet模型，我们定义了多层级损失函数，包括幅度谱的均方误差与感知度量损失、相位谱的反包裹损失，以及短时复数谱的均方误差和一致性损失。实验结果表明，我们提出的MP-SENet在语音去噪、去混响和带宽扩展等多种任务中均能实现高质量语音增强。与现有相位感知语音增强方法相比，它成功避免了幅度与相位之间的双向补偿效应，从而获得更好的谐波恢复效果。值得注意的是，在语音去噪任务中，MP-SENet在公共VoiceBank+DEMAND数据集上取得了3.60的PESQ得分，达到当前最优性能。