As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.73\times$, and $2.27\times$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96\times$ on our optimized offloading system (5.6 s/token), $9.7\times$ than DeepSpeed-Zero-Inference, $19.5\times$ than Huggingface Accelerate.
翻译:随着大型语言模型(LLMs)的使用日益增长,如何高效执行这些模型的推理变得愈发重要。尽管推测性解码近期成为加速推理的有前景方向,但现有方法在扩展至更大推测预算、适应不同超参数及硬件方面存在局限。本文提出Sequoia——一种可扩展、鲁棒且硬件感知的推测性解码算法。为实现更优的可扩展性,Sequoia引入动态规划算法来寻找推测令牌的最优树结构;为获得稳健的推测性能,Sequoia采用新颖的采样与验证方法,在不同解码温度下均优于先前工作;此外,Sequoia提出硬件感知的树优化器,通过针对特定硬件平台自动选择令牌树规模与深度来最大化推测性能。评估结果表明,在A100上,Sequoia将Llama2-7B、Llama2-13B及Vicuna-33B的解码速度分别提升至原来的$4.04\times$、$3.73\times$和$2.27\times$。在L40的卸载场景下,Sequoia对Llama2-70B的精确推理延迟低至0.56秒/令牌,相较我们的优化卸载系统(5.6秒/令牌)提速$9.96\times$,较DeepSpeed-Zero-Inference提速$9.7\times$,较Huggingface Accelerate提速$19.5\times$。