This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed.
翻译:本文提出了一种新颖的语音相位预测模型,该模型通过神经网络直接从幅度谱预测缠绕相位谱。所提模型由残差卷积网络与并行估计架构级联而成。并行估计架构包含两个并行线性卷积层和一个相位计算公式,模拟了从复频谱的实部与虚部计算相位谱的过程,并严格将预测相位值约束在主值区间内。为避免相位缠绕引起的误差扩大问题,我们设计了抗缠绕训练损失函数,通过采用抗缠绕函数激活瞬时相位误差、群延迟误差和瞬时角频率误差,定义了预测缠绕相位谱与自然缠绕相位谱之间的损失。实验结果表明,在重构语音质量和生成速度方面,我们提出的神经语音相位预测模型均优于迭代式Griffin-Lim算法及其他基于神经网络的方法。