Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
翻译:自回归解码因其顺序性质而成为性能瓶颈。推测解码已成为一种标准加速推理方法,它通过使用快速草稿模型预测慢速目标模型的后续令牌,并利用单次目标模型前向传播并行验证这些预测。然而,推测解码本身依赖于推测与验证之间的顺序依赖关系。我们提出推测式推测解码(SSD)以实现这些操作的并行化。在验证进行期间,草稿模型预测可能的验证结果,并为其预先准备推测序列。若实际验证结果属于预测集合,则可立即返回对应推测,完全消除草稿生成开销。我们识别出推测式推测解码面临的三个关键挑战,并提出解决每个挑战的原则性方法。最终成果是经过优化的SSD算法Saguaro。我们的实现在开源推理引擎中,比优化推测解码基线快达2倍,比自回归解码快达5倍。