This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.
翻译:本文提出了一种新颖的阶段式先验感知神经语音相位预测模型,该模型通过两阶段神经网络从输入的幅度谱预测相位谱。在初始的先验构建阶段,我们从幅度谱初步预测一个粗略的先验相位谱。在随后的精炼阶段,模型以前述先验相位为条件,将幅度谱转换为精炼后的高质量相位谱。两个阶段的网络均以ConvNeXt v2模块为骨干,并通过创新性地引入相位谱判别器采用对抗训练。为了进一步提升精炼后相位的连续性,我们还在精炼阶段引入了时频综合差分损失函数。实验结果证实,与基于神经网络的无先验相位预测方法相比,所提出的SP-NSPP模型得益于引入了粗略相位先验和多样化的训练准则,实现了更高的相位预测精度。与迭代式相位估计算法相比,我们提出的SP-NSPP无需多轮阶段迭代,从而具有更高的生成效率。