Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech
翻译:近年来,基于图神经网络(GNN)的文本转语音技术显著提升了短句合成语音的表现力。然而,生成具有高动态韵律变化的人类级长篇语音仍具挑战性。为解决此问题,我们提出名为HiGNN-TTS的分层韵律建模方法,拓展了GNN的能力。具体而言,在图中添加虚拟全局节点以强化词节点间的相互连接,并引入上下文注意力机制将韵律建模范围从句内扩展至句间。此外,对图中每个节点实施基于声学韵律的分层监督,以捕获高动态范围的韵律变化。消融实验验证了HiGNN-TTS在学习分层韵律方面的有效性。主客观评估均表明,HiGNN-TTS显著提升了长篇合成语音的自然度与表现力。