EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at https://emosteer-tts-demo.pages.dev/.

翻译：近年来，文本到语音（TTS）技术取得了显著进展。然而，现有的大多数TTS系统仅能通过离散的情感标签或精心设计、细节丰富的情感文本提示，提供粗粒度且僵硬的情感控制，这使得细粒度的情感操控要么难以实现，要么极不稳定。这些模型通常还需要大量高质量数据集进行训练。为应对这些局限，我们提出了EmoSteer-TTS，一种新颖的免训练方法，通过激活导向实现细粒度的语音情感控制（转换、插值、擦除）。我们首先通过实验观察到，在基于流匹配的TTS模型内部，修改部分激活值可以有效改变合成语音的情感基调。基于这一发现，我们随后开发了一种免训练且高效的算法，包括激活提取、情感令牌搜索和推理时导向，该算法可无缝集成到多种预训练模型（如F5-TTS、CosyVoice2和E2-TTS）中。此外，为获得有效的导向向量，我们构建了一个包含多样化说话人的精选情感语音数据集。大量实验表明，EmoSteer-TTS能够实现对语音情感的细粒度、可解释且连续的控制，其性能优于当前最先进（SOTA）方法。据我们所知，这是首个在TTS中实现免训练、连续细粒度情感控制的方法。演示样本可在 https://emosteer-tts-demo.pages.dev/ 获取。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日