We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.
翻译:本文提出后缀解码(SuffixDecoding),一种通过推测式解码加速大语言模型(LLM)推理的新型无模型方法。与依赖草稿模型或专用解码头的现有方法不同,后缀解码利用基于先前生成输出构建的后缀树来高效预测候选词元序列。该方法支持灵活的树状结构推测,无需维护和协调额外模型的开销。后缀解码通过动态构建和更新后缀树来捕捉生成文本中的模式,并基于经验词元频率的规范化评分机制构建推测树。后缀解码仅需占用CPU内存,而典型LLM服务节点上此类内存资源充足且利用率不足。实验表明,在开放域对话、代码生成和文本转SQL等多种工作负载下,后缀解码相比基于模型的方法实现了具有竞争力的加速效果。在开放式对话和代码生成任务中,后缀解码相比SpecInfer实现了最高$1.4\times$的输出吞吐量提升,以及最高$1.1\times$的单词元处理时间(TPOT)延迟降低。在专有的多LLM文本转SQL应用中,后缀解码相比推测式解码实现了最高$2.9\times$的输出吞吐量提升和$3\times$的延迟降低。评估结果表明,即使仅使用256个示例的小规模参考语料库,后缀解码仍能保持较高的接受率,且随着历史输出数据的持续积累,其性能可进一步提升。