Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
翻译:在文本到语音(TTS)合成中,实现精确且可控的情感表达对于生成自然且符合语境的语音至关重要。然而,许多情感感知TTS系统,包括基于大语言模型(LLM)的设计,依赖于扩展固定的情感嵌入或外部引导,这限制了其对情感特定潜在特征进行建模的能力。为弥补这一不足,我们提出了EmoShift,一个轻量级的激活导向框架,其包含一个EmoSteer层。该层在输出嵌入空间中为每个目标情感学习一个导向向量,以捕捉其潜在偏移,并在不同话语和类别间保持稳定、恰当的表达。EmoShift仅需1000万个可训练参数(少于全参数微调的1/30),在客观和主观评估中均优于零样本和全参数微调的基线模型,在提升情感表现力的同时,保持了自然度和说话人相似性。进一步的分析证实了所提出的EmoSteer层的有效性,并揭示了其在语音合成中实现可控情感强度的潜力。