Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model's embedding space. By splitting each token's probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.
翻译:推测解码通过让快速草稿模型提出标记并由更大的目标模型进行验证,从而加速自回归语音生成。然而,对于生成声学标记的语音大语言模型,精确的标记匹配限制过于严格:许多离散标记在声学或语义上是可互换的,这会降低接受率并限制加速效果。我们引入了基于原则的粗粒度化方法,该方法在从目标模型的嵌入空间导出的声学相似性群组层面验证提议。通过将每个标记的概率质量分配到包含它的重叠群组中,我们定义了一个重叠感知的粗粒度分布,并对由此产生的群组变量执行拒绝采样。这保证了群组层面的精确性,同时允许在实践中使用被接受的草稿标记来代表群组中的任何成员。在LibriTTS数据集上,相对于标准的推测解码和先前的语音特定松弛方法,PCG提高了接受率和吞吐量,同时保持了可懂度和说话人相似性。这些结果表明,声学感知的群组级接受是一种简单且通用的方法,可以在保持语音质量的同时加速语音标记生成。