In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.
翻译:本文提出一种改进的推测解码方法,旨在提升大语言模型的服务效率。该方法融合了两项成熟技术的优势:经典的双模型推测解码方法,以及近年出现的单模型方法Medusa。受Medusa启发,本文采用单模型策略进行推测解码,但创新之处在于使用单一轻量级草稿头搭配循环依赖设计——该结构本质类似于经典推测解码中的小型草稿模型,却避免了完整Transformer架构的复杂性。由于循环依赖的存在,我们可利用束搜索通过草稿头快速过滤无效候选词。最终方法兼具单模型设计的简洁性,同时无需为推理构建Medusa中依赖数据的树形注意力结构。通过在多个主流开源语言模型上的实证验证,我们全面分析了采用本方法所需权衡的利弊。