SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.

翻译：本文提出后缀解码（SuffixDecoding），一种通过推测式解码加速大语言模型（LLM）推理的新型无模型方法。与依赖草稿模型或专用解码头的现有方法不同，后缀解码利用基于先前生成输出构建的后缀树来高效预测候选词元序列。该方法支持灵活的树状结构推测，无需维护和协调额外模型的开销。后缀解码通过动态构建和更新后缀树来捕捉生成文本中的模式，并基于经验词元频率的规范化评分机制构建推测树。后缀解码仅需占用CPU内存，而典型LLM服务节点上此类内存资源充足且利用率不足。实验表明，在开放域对话、代码生成和文本转SQL等多种工作负载下，后缀解码相比基于模型的方法实现了具有竞争力的加速效果。在开放式对话和代码生成任务中，后缀解码相比SpecInfer实现了最高$1.4\times$的输出吞吐量提升，以及最高$1.1\times$的单词元处理时间（TPOT）延迟降低。在专有的多LLM文本转SQL应用中，后缀解码相比推测式解码实现了最高$2.9\times$的输出吞吐量提升和$3\times$的延迟降低。评估结果表明，即使仅使用256个示例的小规模参考语料库，后缀解码仍能保持较高的接受率，且随着历史输出数据的持续积累，其性能可进一步提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日