Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.
翻译:推测解码已成为广泛应用的范式,可在不损害生成质量的前提下加速大语言模型的推理过程。该方法通过紧凑模型高效草拟多个令牌,随后使用目标大语言模型并行验证这些令牌。值得注意的是,自推测解码提出通过跳过特定层来构建草拟模型,从而无需额外参数或训练。尽管具有优势,本研究发现采用层跳过的草拟机制对领域偏移表现出显著敏感性,导致加速性能大幅下降。为增强该范式的领域泛化能力,我们提出KNN-SSD算法,该算法利用K最近邻搜索将不同跳层配置与多样化领域输入进行匹配。我们在多种模型和任务中评估该算法,观察到其应用可实现LLM推理1.3倍至1.6倍的加速效果。