Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting decoding strategies based on deployment priorities such as computational efficiency and synthesis fidelity.
翻译:基于大语言模型(LLM)的语音生成模型通常在离散声学码上运行,这些码因其多码本结构而与文本标记存在根本差异。在每个时间步,模型必须联合预测N个码本条目,这种依赖性对简单的并行预测方法提出了挑战。并行预测假设码本间相互独立,可实现高效解码,但往往以降低保真度为代价。为解决此问题,分层策略采用局部Transformer(LT)来细化预测并捕获时间步内的依赖关系。在本研究中,我们系统性地研究了两种LT架构:一种按顺序生成码本的自回归Transformer,以及一种基于MaskGIT的、执行迭代掩码预测的Transformer。两种设计均支持帧堆叠技术,即主Transformer联合预测多个帧,再由LT解码其码本,从而在不损失感知质量的前提下提升生成速度。通过深入分析,我们刻画了在不同吞吐量与质量区间下并行采样策略与迭代采样策略间的权衡关系。最后,我们提出了基于计算效率与合成保真度等部署优先级选择解码策略的实用指南。