Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.
翻译:情感文本转语音系统因情感表达的固有复杂性及现有情感标签覆盖范围的局限性,难以捕捉人类情感的完整谱系。为此,我们提出一种基于语言模型的文本转语音框架,能够在广泛的情感风格范围内合成语音。该方法使用户能够沿愉悦度、唤醒度与支配度这三个连续维度进行灵活控制。为实现这一目标,我们基于成熟心理学研究训练了一个情感维度预测器,将语音数据集中的离散情感标签映射至PAD空间。值得注意的是,虽然情感维度预测器利用了离散标签,但文本转语音框架本身在训练过程中无需显式的情感标签。主客观评估表明,与基线模型相比,我们的框架能有效生成更具表现力的情感风格,并同时提升语音的自然度与多样性。