We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM, thereby maintaining output quality. The proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its fine-tuned models demonstrated a speedup up to 1.73$\times$.
翻译:我们提出了一种新颖的推理方案——自推测解码,用于加速大型语言模型(LLMs),且无需辅助模型。该方法采用两阶段流程:草稿生成与验证。草稿生成阶段通过有选择性地跳过某些中间层,以略低质量但更快的速度生成草稿令牌。随后,验证阶段利用原始LLM在前向传播中对这些草稿输出令牌进行验证。这一过程确保最终输出与未修改的LLM生成的输出完全一致,从而保持输出质量。所提出的方法无需额外的神经网络训练,也不占用额外内存,是一种即插即用且经济高效的推理加速解决方案。基于LLaMA-2及其微调模型的基准测试显示,加速比可达1.73倍。