Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.
翻译:基于编解码器的自回归(AR)语音语言模型通过将语音建模为离散音频令牌序列,并利用大规模预训练骨干网络,实现了高质量的文本转语音(TTS)性能。然而,这种令牌级别的建模方式造成了结构性的效率瓶颈:语音令牌序列远长于文本序列,导致AR骨干网络需要在每个令牌位置进行因果计算,并维护一个随序列长度增长的键值(KV)缓存。本文提出TLDR,一种基于分块(patch)的自回归框架,通过将因果建模从令牌级语音序列转移到分块级序列,加速了基于编解码器的AR-TTS系统。TLDR使用轻量化压缩器将连续的编解码器令牌分组为紧凑的潜在分块,利用经LoRA适配的冻结预训练AR-TTS骨干网络对生成的较短分块序列进行建模,再通过说话人条件提取器重建每个分块内的细粒度语音令牌。在分块大小为4时,TLDR相比基线AR-TTS模型实现了1.8倍的推理加速,并将全局KV缓存内存占用减少高达75%。实验结果表明,分块级全局因果建模可成为降低预训练编解码器AR-TTS系统推理成本的一种实用方法,而无需替换现有模块。