Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
翻译:近期基于大语言模型(LLM)的文本转语音(TTS)系统在语音质量和零样本能力上表现优异,但由于其依赖离散语音标记,缺乏细粒度的情感控制。现有方法要么将情感限制为分类标签,要么无法推广到基于LLM的架构。我们提出了EMORL-TTS(基于强化学习的细粒度可控情感TTS),这是一个将VAD空间中的全局强度控制与局部重音调节相统一的框架。我们的方法结合了监督微调和由情感类别、强度及重音等任务特定奖励引导的强化学习。此外,我们进一步研究了重音位置如何调节细粒度的情感强度。实验表明,EMORL-TTS在保持与强大的基于LLM的基线模型相当的合成质量的同时,提升了情感准确性、强度区分度和重音清晰度。