In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
翻译:本文提出了一种改进的推测解码方法,旨在提升大型语言模型的服务效率。我们的方法融合了两种成熟技术的优势:经典的双模型推测解码方法,以及近期提出的单模型方法Medusa。受Medusa启发,我们采用单模型策略进行推测解码。然而,本方法的独特之处在于使用单一轻量级起草头部,并采用循环依赖设计,本质上类似于经典推测解码中的小型起草模型,但避免了完整Transformer架构的复杂性。由于循环依赖的存在,我们可以利用束搜索通过起草头部快速过滤掉不理想的候选结果。最终方法兼具单模型设计的简洁性,同时无需像Medusa那样仅在推理阶段创建数据依赖的树注意力结构。我们通过多个主流开源语言模型的实验证明了所提方法的有效性,并全面分析了采用该方法所涉及的权衡。