Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
翻译:Transformer语言模型以自回归方式生成文本,导致推理延迟与生成标记数量成正比。推测解码技术通过利用小型草稿模型提出标记,并由大型目标模型并行验证,在不牺牲输出质量的前提下降低延迟。然而在实际应用中,可能存在一组潜在的草稿模型——从速度更快但准确度较低,到速度较慢但更可靠。本文提出层次化推测解码算法,将这些草稿模型堆叠为层次结构:每个模型提出标记后,由下一级更大模型在单次前向传播中完成验证,最终由目标模型完成最终验证。我们推导了此类层次结构的期望延迟表达式,并证明可在多项式时间内选择延迟最优的层次结构。实验表明,HSD相比最佳单草稿基线可实现最高1.2倍的加速,证明该算法在降低生成延迟方面超越了现有技术。