Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
翻译:推测解码是一种在推理过程中无损加速大型语言模型的有效方法。它使用一个快速模型草拟一个令牌块,然后由目标模型并行验证,并保证输出分布与目标模型的采样结果完全相同。在先前工作中,草稿验证是逐个令牌独立进行的。令人惊讶的是,我们发现这种方法并非最优。我们提出块验证,这是一种简单的草稿验证算法,它联合验证整个块并提供额外的实际时钟加速。我们证明所提机制在每次迭代产生的令牌期望数量上是最优的,且明确保证绝不劣于标准令牌级验证。实证表明,在一系列任务和数据集中,块验证相比标准令牌验证算法能提供5%-8%的适度但稳定的实际时钟加速。鉴于块验证不会增加代码复杂度,保持了标准推测解码验证算法的强无损保证,不会降低性能,且实际上能持续提升性能,它可以作为推测解码实现中的良好默认方案。