EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.

翻译：在文本到语音（TTS）合成中，实现精确且可控的情感表达对于生成自然且符合语境的语音至关重要。然而，许多情感感知TTS系统，包括基于大语言模型（LLM）的设计，依赖于扩展固定的情感嵌入或外部引导，这限制了其对情感特定潜在特征进行建模的能力。为弥补这一不足，我们提出了EmoShift，一个轻量级的激活导向框架，其包含一个EmoSteer层。该层在输出嵌入空间中为每个目标情感学习一个导向向量，以捕捉其潜在偏移，并在不同话语和类别间保持稳定、恰当的表达。EmoShift仅需1000万个可训练参数（少于全参数微调的1/30），在客观和主观评估中均优于零样本和全参数微调的基线模型，在提升情感表现力的同时，保持了自然度和说话人相似性。进一步的分析证实了所提出的EmoSteer层的有效性，并揭示了其在语音合成中实现可控情感强度的潜力。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

专知会员服务

24+阅读 · 2025年10月29日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

23+阅读 · 2024年12月13日

《语音大语言模型》最新进展综述

专知会员服务

57+阅读 · 2024年10月8日

《大型语言模型情感认知》最新进展

专知会员服务

43+阅读 · 2024年10月3日