It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
翻译:在文本到语音(TTS)合成中有效控制情感渲染仍是一项挑战。现有研究主要集中于在语句级别学习与语言韵律高度相关的全局韵律表征。本文旨在构建层次化情感分布(ED),该分布能有效捕捉音素、词语及语句等不同粒度层次上情感强度的变化。在TTS训练过程中,层次化ED从真实音频中提取,并引导预测器建立情感韵律与语言韵律之间的关联。在运行时推理阶段,TTS模型不仅生成情感语音,还能对语音成分的情感强度进行量化控制。客观与主观评估均验证了所提框架在情感预测与控制方面的有效性。