Generating natural speech with diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by human. This is largely due to the use of uni-modal distribution, such as single Gaussian, in the prior works of phone-level prosody modelling. In this work, we propose a novel approach that models phone-level prosodies with GMM based mixture density network (GMM-MDN). Experiments on the LJSpeech dataset demonstrate that phone-level prosodies can precisely control the synthetic speech and GMM-MDN can generate more natural and smooth prosody pattern than a single Gaussian. Subjective evaluations further show that the proposed approach not only achieves better naturalness, but also significantly improves the prosody diversity in synthetic speech without the need of manual control.
翻译:生成具有多样且平滑韵律模式的自然语音是一项具有挑战性的任务。尽管已有研究采用音素级韵律分布的随机采样来生成不同的韵律模式,但所生成语音的多样性仍然非常有限,远不及人类所能达到的水平。这主要归因于在先前的音素级韵律建模工作中使用了单模态分布(例如单一高斯分布)。在本研究中,我们提出了一种新方法,使用基于高斯混合模型的混合密度网络(GMM-MDN)对音素级韵律进行建模。在LJSpeech数据集上的实验表明,音素级韵律能够精确控制合成语音,并且GMM-MDN能够比单一高斯分布生成更自然、更平滑的韵律模式。主观评价进一步显示,所提出的方法不仅实现了更好的自然度,而且在不需手动控制的情况下显著提高了合成语音的韵律多样性。