Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
翻译:人类语音中的情感表达具有细腻性与组合性,常涉及多种有时甚至相互冲突的情感线索,这些线索可能与语言内容相背离。相比之下,大多数富有表现力的文本转语音系统强制实施单一的话语级情感,从而压缩了情感多样性,并抑制了混合情感或文本-情感不一致的表达。尽管通过潜在方向向量进行激活导向提供了一种有前景的解决方案,但情感表征在语音合成中是否可线性导向、在混合语音合成架构中应在何处施加导向,以及应如何评估此类复杂情感行为,这些问题仍未明确。本文首次对混合语音合成模型中用于情感控制的激活导向进行了系统性分析,引入了一个可量化、可控的导向框架,以及多评分者评估方案,该方案支持可组合的混合情感合成与可靠的文本-情感错位合成。我们的结果首次证明,情感韵律和表达变异性主要由语音合成的语言模块而非流匹配模块合成,同时也为生成自然、类人的情感语音提供了一种轻量级的导向方法。