Large Language Models (LLMs) have revolutionized natural language processing by unifying tasks into text generation, yet their large parameter sizes and autoregressive nature limit inference speed. SAM-Decoding addresses this by introducing a novel retrieval-based speculative decoding method that uses a suffix automaton for efficient and accurate draft generation. Unlike n-gram matching used by the existing method, SAM-Decoding finds the longest suffix match in generating text and text corpuss, achieving an average time complexity of $O(1)$ per generation step. SAM-Decoding constructs static and dynamic suffix automatons for the text corpus and input prompts, respectively, enabling fast and precise draft generation. Meanwhile, it is designed as an approach that can be combined with existing methods, allowing SAM-Decoding to adaptively select a draft generation strategy based on the matching length, thus increasing the inference speed of the LLM. When combined with Token Recycling, evaluations show SAM-Decoding outperforms existing model-free methods, achieving a speedup of $2.27\times$ over autoregressive decoding on Spec-Bench. When combined with EAGLE2, it reaches a speedup of $2.49\times$, surpassing all current approaches. Our code is available at https://github.com/hyx1999/SAM-Decoding.
翻译:大型语言模型通过将各类任务统一为文本生成,彻底变革了自然语言处理领域,但其庞大的参数量与自回归特性限制了推理速度。SAM解码提出了一种新颖的基于检索的推测解码方法,通过构建后缀自动机实现高效准确的草稿生成。相较于现有方法采用的n-元语法匹配,SAM解码在生成文本与语料库之间寻找最长后缀匹配,实现每生成步骤平均$O(1)$的时间复杂度。该方法分别为文本语料库和输入提示构建静态与动态后缀自动机,从而实现快速精确的草稿生成。同时,该方法设计为可与现有技术结合的通用框架,使SAM解码能根据匹配长度自适应选择草稿生成策略,从而提升大型语言模型的推理速度。结合Token Recycling技术的评估表明,SAM解码在Spec-Bench上以$2.27\times$的加速比超越自回归解码,优于所有无模型方法。与EAGLE2结合时,加速比达到$2.49\times$,超越了当前所有方法。代码已开源:https://github.com/hyx1999/SAM-Decoding。