Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.
翻译:自回归解码使得大型语言模型(LLMs)的推理过程耗时较长。我们提出了一种名为EAGLE(大语言模型效率外推算法)的简单框架,用于实现无损加速。与传统推测性采样方法不同,EAGLE在更规则(第二顶层)的特征层上进行自回归式草稿生成,并通过整合一个时间步长后的标记来解决下一个特征预测问题中的采样不确定性。EAGLE提供的加速是无损的:无需对目标LLM进行微调,且生成的文本与标准自回归解码保持相同的分布。截至本文提交时,EAGLE是推测性采样范畴内已知最快的框架。在MT-bench上,EAGLE比标准解码快3倍,比Lookahead快2倍,比Medusa快1.6倍。使用gpt-fast,EAGLE在单个RTX 3090 GPU上基于LLaMA2-Chat 13B平均达到每秒160个标记,而Huggingface的实现仅每秒24个标记。