Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
翻译:现代文本转语音(TTS)系统越来越多地利用大语言模型(LLM)架构来实现可扩展、高保真、零样本的生成。然而,这些系统通常依赖于固定帧率的声学标记化,导致生成的语音序列长度显著长于其对应文本且与之异步。除了计算效率低下之外,这种序列长度差异常常在TTS中引发幻觉,并在口语语言建模(SLM)中加剧模态鸿沟。本文提出了一种新颖的标记化方案,该方案在连续声学特征与文本标记之间建立了一对一的同步关系,从而使得在LLM内进行统一的单流建模成为可能。我们证明,这些同步标记能够保持高保真的音频重建,并且可以通过一个带有流匹配头的大语言模型在潜在空间中有效建模。此外,在上下文中无缝切换语音模态的能力使得纯文本引导成为可能——这是一种融合纯文本模式和文本-语音模式逻辑值的技巧,能够灵活地弥合与纯文本LLM智能之间的差距。实验结果表明,我们的方法实现了与最先进的TTS和SLM系统相竞争的性能,同时几乎消除了内容幻觉并保持了语言完整性,且推理成本显著降低。