We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting. Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM. Moreover, the proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$\times$.
翻译:我们提出了一种新颖的推理方案——自推测解码,用于加速大型语言模型(LLM),且无需辅助模型。该方法采用两阶段流程:草稿生成与验证。在草稿生成阶段,通过有选择性地跳过部分中间层,以略低的质量但更快的速度生成草稿词元。随后,验证阶段利用原始LLM在前向传播中一次性验证这些草稿输出词元。此过程确保最终输出与未经修改的LLM生成的输出完全一致。此外,该方法无需额外神经网络训练,也不增加额外内存开销,是一种即插即用且经济高效的推理加速方案。基于LLaMA-2及其变体的基准测试表明,加速比最高可达1.99倍。