To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by up to 1.31x and 2.70x compared to Medusa decoding and autoregressive de-coding respectively. Overall, Hydra heads are a simple and well-motivated intervention on standard draft heads that significantly improve the end-to-end speed of draft head-based speculative decoding. We make our code publicly available at https://github.com/zankner/Hydra.
翻译:为应对自回归大语言模型推理中内存带宽受限的问题,先前研究提出了推测解码框架。在推测解码中,一个较小的草稿模型会基于输入序列提出候选延续,随后由基础模型并行验证这些候选。指定草稿模型的一种方式(如近期Medusa解码框架所采用)是将其构建为一组轻量级头部——称为草稿头——这些头部基于基础模型的隐藏状态进行操作。迄今为止,所有现有的草稿头均为顺序独立的,这意味着它们对候选延续中令牌的推测独立于候选延续中任何先前的令牌。本工作中,我们提出Hydra头:一种顺序依赖型的即插即用替代方案,用于替换标准草稿头,显著提升了草稿头推测的准确性。我们进一步探索了Hydra头训练目标与架构的设计空间,并提出了一种经过精心调优的Hydra头方案,称之为Hydra++。相比Medusa解码和自回归解码,Hydra++分别将解码吞吐量最高提升至1.31倍和2.70倍。总体而言,Hydra头是对标准草稿头的一种简单且动机明确的改进,能显著提升基于草稿头的推测解码的端到端速度。我们的代码已在https://github.com/zankner/Hydra 公开。