TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Code and audio samples are available.

翻译：尽管可控文本到语音合成（TTS）已取得显著进展，但现有方法大多局限于语篇间层面的控制，由于依赖非公开数据集或复杂的多阶段训练，难以实现细粒度的语篇内表达。本文提出TED-TTS，一个面向预训练零样本TTS的免训练可控框架，旨在实现语篇内情感与时长表达。具体而言，我们提出一种基于片段感知的情感调节策略，该策略结合因果掩码与单调流对齐过滤，以隔离情感调节并调度掩码过渡，从而在保持全局语义连贯性的同时实现平滑的语篇内情感转换。在此基础上，进一步提出基于片段感知的时长引导策略，该策略将局部时长嵌入引导与全局EOS逻辑调节相结合，在确保全局终止一致性的前提下实现局部时长调整。为消除对片段级手工提示工程的依赖，我们构建了一个包含30,000条样本的多情感与时长标注文本数据集，以实现基于大语言模型的自动提示构建。大量实验表明，所提出的免训练方法不仅在多情感与时长控制中取得了最优的语篇内一致性，同时保持了基础TTS模型基线水平的语音质量。代码与音频样本已公开提供。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

迈向可控语音合成：大语言模型时代的综述

专知会员服务

24+阅读 · 2024年12月13日