Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
翻译:情感文本到语音合成(TTS)旨在从输入文本生成逼真的情感语音。然而,对多层次情感渲染进行定量控制仍然具有挑战性。本文提出了一种基于扩散模型的情感TTS框架,并采用一种新颖的情感强度建模方法,以促进在音素、单词和话语级别上对情感渲染进行细粒度控制。我们引入了一个分层情感分布(ED)提取器,用于捕获跨不同语音片段级别的可量化ED嵌入。此外,我们探索了多种声学特征,并评估了它们对情感强度建模的影响。在TTS训练过程中,分层ED嵌入有效地捕获了参考音频中情感强度的变化,并将其与语言和说话人信息相关联。该TTS模型不仅在推理过程中生成情感语音,还能对语音成分的情感渲染进行定量控制。客观和主观评估均证明了我们的框架在语音质量、情感表现力和分层情感控制方面的有效性。