Current emotional text-to-speech (TTS) systems face challenges in mimicking a broad spectrum of human emotions due to the inherent complexity of emotions and limitations in emotional speech datasets and models. This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance, and can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training. We train an emotional attribute predictor using only categorical labels from speech data, aligning with psychological research and incorporating anchored dimensionality reduction on self-supervised learning (SSL) features. The TTS framework converts text inputs into phonetic tokens via an autoregressive language model and uses pseudo-emotional dimensions to guide the parallel prediction of fine-grained acoustic details. Experiments conducted on the LibriTTS dataset demonstrate that our framework can synthesize speech with enhanced naturalness and a variety of emotional styles by effectively controlling emotional dimensions, even without the inclusion of any emotional speech during TTS training.
翻译:当前的情感文本转语音系统因情感固有的复杂性、情感语音数据集及模型的局限性,在模拟人类广泛情感谱系方面面临挑战。本文提出一种文本转语音框架,该框架能够实现对愉悦度、唤醒度与支配度的控制,且无需在文本转语音训练阶段使用任何情感语音数据即可合成多样化的情感风格。我们仅利用语音数据的分类标签训练情感属性预测器,该方法符合心理学研究范式,并在自监督学习特征上结合了锚定降维技术。该文本转语音框架通过自回归语言模型将文本输入转换为音素标记,并利用伪情感维度引导细粒度声学细节的并行预测。在LibriTTS数据集上进行的实验表明,即使在文本转语音训练阶段未引入任何情感语音数据,我们的框架仍能通过有效控制情感维度,合成具有更高自然度及多样化情感风格的语音。