Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.
翻译:推测解码旨在通过并行验证较小草稿模型生成的标记来加速语言模型的自回归生成。本文探讨了无学习、成本可忽略的草稿策略的有效性,即从模型权重和上下文中获取的 $N$-元语法。虽然基础模型的预测下一个标记很少是这些简单策略的 top 预测,但我们观察到它通常位于其 top-$k$ 预测中(对于较小的 $k$)。基于此,我们证明简单策略的组合可以在不同任务上实现显著的推理加速。整体性能与更复杂的方法相当,但无需昂贵的预处理或基础模型修改,并支持无缝的“即插即用”集成到流程中。