In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
翻译:本文提出了一种改进的推测解码方法,旨在提升大型语言模型的服务效率。我们的方法融合了两种成熟技术的优势:经典的双模型推测解码方案,以及近期提出的单模型方案Medusa。受Medusa启发,本方法采用单模型策略进行推测解码。然而,我们的方案通过采用具有循环依赖设计的轻量级草稿头实现创新——其本质类似于经典推测解码中使用的小型草稿模型,但避免了完整Transformer架构的复杂性。得益于循环依赖特性,我们可以利用波束搜索在草稿头中快速筛选不理想的候选序列。最终形成的方法既保留了单模型设计的简洁性,又无需像Medusa那样仅为推理过程构建数据依赖的树状注意力结构。我们通过实证研究在多个主流开源语言模型上验证了该方法的有效性,并对采用该方案所涉及的权衡进行了全面分析。