While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Code and audio samples are available.
翻译:尽管可控文本到语音合成(TTS)已取得显著进展,但现有方法大多局限于语篇间层面的控制,由于依赖非公开数据集或复杂的多阶段训练,难以实现细粒度的语篇内表达。本文提出TED-TTS,一个面向预训练零样本TTS的免训练可控框架,旨在实现语篇内情感与时长表达。具体而言,我们提出一种基于片段感知的情感调节策略,该策略结合因果掩码与单调流对齐过滤,以隔离情感调节并调度掩码过渡,从而在保持全局语义连贯性的同时实现平滑的语篇内情感转换。在此基础上,进一步提出基于片段感知的时长引导策略,该策略将局部时长嵌入引导与全局EOS逻辑调节相结合,在确保全局终止一致性的前提下实现局部时长调整。为消除对片段级手工提示工程的依赖,我们构建了一个包含30,000条样本的多情感与时长标注文本数据集,以实现基于大语言模型的自动提示构建。大量实验表明,所提出的免训练方法不仅在多情感与时长控制中取得了最优的语篇内一致性,同时保持了基础TTS模型基线水平的语音质量。代码与音频样本已公开提供。