The rapid growth in the parameters of large language models (LLMs) has made inference latency a fundamental bottleneck, limiting broader application of LLMs. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm, leveraging the parallel capabilities of modern hardware. Some speculative decoding methods rely on additional structures to guess draft tokens, such as small models or parameter-efficient architectures, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first search (BFS)-like algorithm on the matrix to construct a draft tree. The tree is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a training method by 25\%. It can be directly applied to any existing LLMs and tasks without the need for adaptation.
翻译:大语言模型(LLM)参数规模的快速增长使得推理延迟成为关键瓶颈,限制了其更广泛的应用。推测解码是一种通过“猜测-验证”范式加速推理的无损方法,能够充分利用现代硬件的并行计算能力。部分推测解码方法依赖额外结构(如小规模模型或参数高效架构)来生成草稿令牌,这些结构在使用前需要额外训练。另一种基于检索的无训练方法则通过从现有语料库构建检索库或基于n-gram生成实现,但面临存储开销大、检索耗时、适应性有限等挑战。通过观察发现解码过程中产生的候选令牌很可能在未来序列中重复出现,我们提出令牌循环方法。该方法将候选令牌存储于邻接矩阵中,并采用类广度优先搜索算法在矩阵上构建草稿树,随后通过树注意力机制进行验证。解码过程中产生的新候选令牌将用于更新矩阵。令牌循环仅需增加<2MB的存储开销,即可在各类规模的LLM上实现约2倍的加速效果。其性能显著优于现有无训练方法达30%,甚至超过某训练方法25%。该方法无需适配即可直接应用于现有任意LLM及任务。