Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.
翻译:近期基于图神经网络的文本转语音技术显著提升了短文本合成语音的表现力。然而,生成具有高动态韵律变化且媲美人类水平的长文本语音仍然具有挑战性。为解决该问题,我们通过分层韵律建模方法扩展了图神经网络的能力,并将其命名为HiGNN-TTS。具体而言,我们在图中添加虚拟全局节点以强化词节点间的互联,并引入上下文注意力机制,将图神经网络的韵律建模范围从句内扩展至句间。此外,我们对图中每个节点实施基于声学韵律的分层监督,以捕捉高动态范围的韵律变化。消融实验验证了HiGNN-TTS在分层韵律学习中的有效性。客观与主观评估均表明,HiGNN-TTS显著提升了长文本合成语音的自然度与表现力。