Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.
翻译:近年来,文本到语音(TTS)技术取得了显著进展。然而,现有的大多数TTS系统仅能通过离散的情感标签或精心设计、细节丰富的情感文本提示,提供粗粒度且僵硬的情感控制,这使得细粒度的情感操控要么难以实现,要么极不稳定。这些模型通常还需要大量高质量数据集进行训练。为应对这些局限,我们提出了EmoSteer-TTS,一种新颖的免训练方法,通过激活导向实现细粒度的语音情感控制(转换、插值、擦除)。我们首先通过实验观察到,在基于流匹配的TTS模型内部,修改部分激活值可以有效改变合成语音的情感基调。基于这一发现,我们随后开发了一种免训练且高效的算法,包括激活提取、情感令牌搜索和推理时导向,该算法可无缝集成到多种预训练模型(如F5-TTS、CosyVoice2和E2-TTS)中。此外,为获得有效的导向向量,我们构建了一个包含多样化说话人的精选情感语音数据集。大量实验表明,EmoSteer-TTS能够实现对语音情感的细粒度、可解释且连续的控制,其性能优于当前最先进(SOTA)方法。据我们所知,这是首个在TTS中实现免训练、连续细粒度情感控制的方法。演示样本可在 https://emosteer-tts-demo.pages.dev/ 获取。