Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
翻译:人类语音中的情感表达具有细微差异和组合性,常包含多重甚至相互矛盾的情感线索,这些线索可能与语言内容相偏离。相比之下,大多数表现力文本转语音系统强制采用单一的语句级情感,从而抹杀了情感多样性并抑制了混合式或文本-情感不匹配的表达。尽管通过潜在方向向量进行激活引导提供了一种有前景的解决方案,但以下问题仍未明确:情感表征在TTS中是否具有线性可导性、应在混合TTS架构的何处施加引导,以及如何评估此类复杂的情感行为。本文首次对混合TTS模型中用于情感控制的激活引导进行了系统分析,提出了一个定量化、可控的引导框架,以及支持可组合混合情感合成与可靠文本-情感不匹配合成的多评价者评估协议。我们的研究结果首次证明,情感韵律和表现力变异性主要由TTS语言模块而非流匹配模块合成,并同时提供了一种轻量级引导方法,用于生成自然、类人的情感语音。