We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.
翻译:我们提出了基于检索的推测解码(REST),这是一种旨在加速语言模型生成过程的新算法。REST的核心设计灵感源于文本生成过程中常包含特定常见阶段与模式的观察。与以往依赖草稿语言模型进行推测解码的方法不同,REST利用检索能力生成草稿令牌。该方法从现有知识库中提取内容,根据当前上下文检索并运用相关令牌。其即插即用特性使其能够无缝集成并加速任意语言模型,且无需额外训练。在单批处理设置下对7B和13B语言模型进行基准测试时,REST在代码或文本生成任务上实现了1.62倍至2.36倍的显著加速。REST代码已开源至https://github.com/FasterDecoding/REST。