As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04\times$, $3.84\times$, and $2.37\times$, and Llama2-70B offloading by up to $10.33\times$ on L40.
翻译:随着大语言模型(LLMs)使用量的增长,高效执行这些模型的推理任务变得日益重要。尽管推测解码近期成为加速推理的有前景方向,但现有方法在扩展至更大推测预算、适应不同超参数及硬件方面存在局限性。本文提出Sequoia——一种可扩展、鲁棒且硬件感知的推测解码算法。为实现更优的可扩展性,Sequoia引入动态规划算法以寻找推测令牌的最优树结构。为达成稳健的推测性能,Sequoia采用新型采样与验证方法,在不同解码温度下均优于先前工作。最后,Sequoia提出硬件感知的树优化器,通过为给定硬件平台自动选择令牌树尺寸与深度来最大化推测性能。评估显示,在A100上,Sequoia使Llama2-7B、Llama2-13B和Vicuna-33B的解码速度分别提升至$4.04\times$、$3.84\times$和$2.37\times$;在L40上,Llama2-70B卸载的解码速度提升高达$10.33\times$。