Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
翻译:推测解码已迅速成为加速语言模型推理的主流方法,因其能在保证输出完全相同的同时实现显著的加速效果。该方法依赖于一个小型草稿模型,其任务是预测目标模型的输出。最先进的推测解码方法使用仅包含单个解码器层和输出嵌入矩阵的草稿模型,其中后者在最新语言模型的草稿生成时间中占主导地位。近期研究试图通过缩减草稿模型的词汇表来解决这一输出分布瓶颈。虽然这可以提高吞吐量,但当目标标记超出词汇表范围时,会损害推测的有效性。本文主张将词汇推测作为缩减词汇表的替代方案。我们提出了SpecVocab,一种高效且有效的方法,它在每个解码步骤动态选择词汇子集。在多种任务上的实验表明,SpecVocab能够实现比最先进的推测解码方法EAGLE-3更高的接受长度。值得注意的是,这带来了相较于EAGLE-3高达8.1%的平均吞吐量提升。