Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model's ability in making diverse predictions. Thus, we propose a novel prosody predictor based on the denoising diffusion probabilistic model to take advantage of its high-quality generative modeling and training stability. Experiment results confirm that the proposed prosody predictor outperforms the deterministic baseline on both the expressiveness and diversity of prediction results with even fewer network parameters.
翻译:富有表现力的人类语音通常具有丰富且灵活的语音韵律变化。现有表现力语音合成方法中的韵律预测器大多产生确定性预测,这些预测通过直接最小化韵律预测误差的范数来学习。其单峰特性导致与真实分布不匹配,并削弱了模型生成多样化预测的能力。为此,我们提出一种基于去噪扩散概率模型的新型韵律预测器,以利用其高质量生成建模和训练稳定性的优势。实验结果表明,所提出的韵律预测器在预测结果的表达力和多样性方面均优于确定性基线方法,且网络参数更少。