Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.
翻译:尽管情感语音合成领域进展迅速,但近期研究主要集中于模仿特定情感的平均风格。因此,对语音情感进行操控的能力仍局限于若干预定义的标签,这限制了反映情感细微变化的能力。本文提出EmoSphere-TTS,它通过使用球面情感向量来控制合成语音的情感风格与强度,从而合成富有表现力的情感语音。在无需任何人工标注的情况下,我们利用唤醒度、效价和支配度的伪标签,通过笛卡尔-球面坐标变换来建模情感的复杂性。此外,我们提出了一种双重条件对抗网络,通过反映多方面的声学特征来提高生成语音的质量。实验结果证明了该模型能够以高质量、富有表现力的语音来控制情感风格与强度。