Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.
翻译:接收流式文本的流式文本转语音(TTS)对于交互式系统至关重要,然而该方案面临两大挑战:因缺少前瞻信息导致的韵律不自然,以及因上下文无界导致的长文本崩溃。我们提出一种韵律边界感知的后训练策略,利用弱时间对齐数据对预训练的基于大语言模型(LLM)的TTS模型进行适配。具体而言,该模型被适配为在仅提供有限未来文本时,学习在指定的内容边界处提前停止生成。在推理过程中,一个滑动窗口提示携带先前的文本和语音标记,确保上下文有界并实现无缝拼接。评估结果表明,我们的方法在短文本和长文本场景下均优于CosyVoice-Style交错基线。尤其在长文本合成中,该方法实现了66.2%的词错误率绝对降低(从71.0%降至4.8%),并将说话人相似度和情感相似度分别相对提高了16.1%和1.5%,为增量文本的流式TTS提供了一个鲁棒的解决方案。