It remains a significant challenge how to quantitatively control the expressiveness of speech emotion in speech generation. In this work, we present a novel approach for manipulating the rendering of emotions for speech generation. We propose a hierarchical emotion distribution extractor, i.e. Hierarchical ED, that quantifies the intensity of emotions at different levels of granularity. Support vector machines (SVMs) are employed to rank emotion intensity, resulting in a hierarchical emotional embedding. Hierarchical ED is subsequently integrated into the FastSpeech2 framework, guiding the model to learn emotion intensity at phoneme, word, and utterance levels. During synthesis, users can manually edit the emotional intensity of the generated voices. Both objective and subjective evaluations demonstrate the effectiveness of the proposed network in terms of fine-grained quantitative emotion editing.
翻译:在语音生成中,如何量化控制语音情感的表现力仍是一个重大挑战。本文提出了一种新颖的方法,用于操控语音生成中的情感渲染。我们提出了一种层次化的情感分布提取器(即Hierarchical ED),该提取器能在不同粒度级别上量化情感的强度。采用支持向量机(SVM)对情感强度进行排序,从而生成层次化的情感嵌入。随后,将Hierarchical ED集成到FastSpeech2框架中,引导模型在音素、词和话语层级学习情感强度。在合成过程中,用户可以手动编辑生成语音的情感强度。客观和主观评估均证明了所提网络在细粒度量化情感编辑方面的有效性。