Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.
翻译:自回归解码导致大型语言模型(LLM)的推理过程耗时严重。本文重新审视推测采样方法,并推导出两个关键发现:首先,在特征(次顶层)层面进行自回归比在词元层面更为直接;其次,特征(次顶层)层面自回归固有的不确定性限制了其性能。基于这些见解,我们提出了EAGLE(面向更高语言模型效率的外推算法),这是一个简洁而高效的推测采样框架。通过引入提前一个时间步的词元序列,EAGLE有效解决了不确定性问题,能以极低开销实现精准的次顶层特征预测。我们对EAGLE进行了全面评估,涵盖Vicuna和LLaMA2-Chat全系列模型、混合专家模型Mixtral 8x7B Instruct,以及对话、代码生成、数学推理和指令遵循等任务。对于LLaMA2-Chat 70B模型,EAGLE实现了2.7倍至3.5倍的延迟加速比,吞吐量提升两倍,同时完全保持生成文本的分布特性。