To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding framework. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence, that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of light-weight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads, a sequentially dependent, drop-in replacement for standard draft heads that significantly improves speculation accuracy. Decoding with Hydra heads improves throughput compared to Medusa decoding with standard draft heads. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully-tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by 1.31x and 2.71x compared to Medusa decoding and autoregressive decoding, respectively. Overall, Hydra heads are a simple intervention on standard draft heads that significantly improve the end-to-end speed of draft head based speculative decoding.
翻译:摘要:为应对自回归大语言模型推理中内存带宽受限的问题,此前研究提出了推测解码框架。该框架通过一个小型草稿模型生成输入序列的候选续写,再由基础模型并行验证。近期美杜莎解码框架采用的草稿模型实现方式,是将一组轻量级计算模块(称为草稿头)作用于基础模型的隐状态上。迄今所有草稿头均为序列独立型,即它们在推测候选续写中的token时,不考虑该续写中先前token的影响。本文提出Hydra头——一种可直接替代标准草稿头的序列依赖型方案,显著提升了推测准确率。与采用标准草稿头的美杜莎解码相比,Hydra头解码可提升吞吐量。我们进一步探索了Hydra头训练目标与架构的设计空间,并提出经过精细调优的Hydra++方案:相较美杜莎解码和自回归解码,其解码吞吐量分别提升1.31倍和2.71倍。总体而言,Hydra头是对标准草稿头的简单改进,却能显著提升基于草稿头的推测解码端到端速度。