LogitSpec：通过下一词推测加速基于检索的推测解码 (LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation)

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.

翻译：推测解码（Speculative Decoding, SD）通过使用一个小型草稿模型预先生成候选词元，再由目标模型并行验证，已成为一种有前景的大语言模型推理加速技术。许多改进SD的努力旨在消除对草稿模型的需求，转而采用基于检索的方式生成候选词元，以进一步降低草稿生成开销并显著减少部署与应用难度。然而，基于检索的SD依赖于匹配范式来检索最相关的参考作为候选词元，这些方法往往难以找到匹配且准确的候选词元。为应对这一挑战，我们提出LogitSpec，以有效扩大检索范围并找到最相关的参考作为草稿。LogitSpec的动机源于观察到最后一个词元的逻辑值不仅能预测下一个词元，还能推测下下一个词元。具体而言，LogitSpec通过两个步骤生成候选词元：（1）利用最后一个逻辑值推测下下一个词元；（2）为下一个词元及下下一个词元检索相关参考。LogitSpec无需训练且即插即用，可轻松集成到现有的大语言模型推理框架中。在广泛的文本生成基准测试上的大量实验表明，LogitSpec最高可实现2.61倍的加速，且每个解码步骤平均接受3.28个词元。我们的代码公开于https://github.com/smart-lty/LogitSpec。