While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Audio samples are available at https://aclanonymous111.github.io/TED-TTS-DemoPage/.
翻译:尽管可控文本到语音(TTS)技术已取得显著进展,但现有方法大多仍局限于语句间层面的控制。由于依赖非公开数据集或复杂的多阶段训练,实现细粒度的语句内情感表达仍具挑战。本文提出一种面向预训练零样本TTS的免训练可控框架,以实现语句内情感与时值表达。具体而言,我们提出一种片段感知情感条件化策略,将因果掩码与单调流对齐滤波相结合,以隔离情感条件并调度掩码转换,从而在保持全局语义连贯性的同时实现平滑的语句内情感转换。在此基础上,我们进一步提出片段感知时值引导策略,将局部时值嵌入引导与全局终止符对数概率调制相结合,在实现局部时值调整的同时确保全局一致的终止判断。为消除片段级人工提示工程的需求,我们构建了一个包含30,000个样本的多情感与时值标注文本数据集,以支持基于大语言模型的自动提示构建。大量实验表明,我们的免训练方法不仅在多情感与时值控制方面实现了最先进的语句内一致性,同时保持了底层TTS模型的基线级语音质量。音频样本详见 https://aclanonymous111.github.io/TED-TTS-DemoPage/。