Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at https://ctctts.github.io/.
翻译:基于大语言模型(LLM)的文本转语音(TTS)系统能够生成自然语音,但多数系统并未针对低延迟的双流式合成进行设计。高质量的双流式TTS依赖于准确的文本-语音对齐以及精心设计的训练序列,以平衡合成质量与延迟。先前的研究通常依赖于基于GMM-HMM的强制对齐工具包(如MFA),这类工具流程复杂且灵活性不及神经对齐器;而固定比例交织文本与语音标记的方法难以捕捉文本-语音对齐的规律性。我们提出CTC-TTS系统,该系统使用基于CTC的对齐器替代MFA,并引入基于双词的交织策略。我们设计了两种变体:CTC-TTS-L(沿序列长度方向进行标记拼接)以追求更高音质,以及CTC-TTS-F(沿特征维度进行嵌入堆叠)以实现更低延迟。实验表明,在流式合成与零样本任务上,CTC-TTS均优于固定比例交织及基于MFA的基线方法。语音样本可在 https://ctctts.github.io/ 获取。