Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is largely unchanged, but the drafter can become bottlenecked by its O(|V|d) output projection. Recent approaches (e.g., FR-Spec, VocabTrim) mitigate this by restricting drafting to a fixed, frequency-ranked shortlist; however, such static truncation is corpus-dependent and suppresses rare or domain-specific tokens, reducing acceptance and limiting speedups. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism for large-vocabulary speculative decoding. DynaSpec trains lightweight meta-classifiers that route each context to a small set of coarse token clusters; the union of the top-selected clusters defines the drafter's shortlist, while the target model still verifies over the full vocabulary, preserving exactness. Systems-wise, routing is overlapped with draft computation via parallel execution streams, reducing end-to-end overhead. Across standard speculative decoding benchmarks, DynaSpec consistently improves mean accepted length-recovering 98.4% of full-vocabulary performance for Llama-3-8B versus 93.6% for fixed-shortlist baselines-and achieves up to a 2.23x throughput gain compared to 1.91x for static approaches on the dataset with rare tokens.
翻译:推测解码通过让一个小型起草模型提出多个令牌,并由一个大型目标模型在每个推测步骤中一次性验证,从而加速大型语言模型的推理。随着词汇量规模超过10e5个令牌,目标模型中的验证成本基本保持不变,但起草模型可能因其O(|V|d)的输出投影而成为瓶颈。近期方法(例如FR-Spec、VocabTrim)通过将起草限制在一个固定的、按频率排序的短列表中来缓解此问题;然而,这种静态截断依赖于语料库,并抑制了罕见或领域特定令牌,从而降低了接受率并限制了加速效果。我们提出了DynaSpec,一种用于大词汇量推测解码的上下文相关动态短列表机制。DynaSpec训练轻量级元分类器,将每个上下文路由到一小部分粗略令牌簇;所选顶部簇的并集定义了起草模型的短列表,而目标模型仍在完整词汇表上进行验证,从而保持精确性。在系统层面,通过并行执行流将路由与起草计算重叠,减少了端到端开销。在标准推测解码基准测试中,DynaSpec持续提高了平均接受长度——对于Llama-3-8B,恢复了完整词汇表性能的98.4%,而固定短列表基线为93.6%——并且在包含罕见令牌的数据集上实现了高达2.23倍的吞吐量增益,而静态方法为1.91倍。