SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Xupeng Miao,Gabriele Oliaro,Zhihao Zhang,Xinhao Cheng,Zeyu Wang,Zhengxin Zhang,Rae Ying Yee Wong,Alan Zhu,Lijie Yang,Xiaoxiang Shi,Chunan Shi,Zhuoming Chen,Daiyaan Arfeen,Reyna Abhyankar,Zhihao Jia

from arxiv, ASPLOS'24

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/

翻译：本文介绍SpecInfer系统，该系统通过基于树结构的推测推理与验证加速生成式大语言模型（LLM）服务。SpecInfer的核心思想是利用小型推测模型预测LLM的输出，并将预测结果组织成令牌树结构，其中每个节点代表一个候选令牌序列。通过新型树结构并行解码机制，令牌树所表示的所有候选令牌序列的正确性可并行地由LLM进行验证。SpecInfer将LLM用作令牌树验证器而非增量解码器，这显著降低了生成式LLM服务的端到端延迟与计算需求，同时可证明地保持了模型质量。实验评估表明，SpecInfer在分布式LLM推理场景下性能较现有LLM服务系统提升1.5-2.8倍，在基于卸载的LLM推理场景下提升2.6-3.5倍，同时保持相同的生成性能。SpecInfer已开源发布于https://github.com/flexflow/FlexFlow/

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日