Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding.
翻译:推测解码已成为一种加速大语言模型(LLM)推理的有效技术,其通过使用一个小型语言模型来草拟假设序列,随后由LLM进行验证。该方法的有效性在很大程度上取决于草拟模型的性能与效率之间的平衡。在我们的研究中,我们专注于通过生成多个假设而非单一假设,来提高被最终输出接受的草拟标记的比例。这为LLM提供了更多选择,使其能够选取符合其标准的最长序列。我们的分析表明,草拟模型生成的假设共享许多共同的标记序列,这暗示了计算优化的潜力。基于这一观察,我们引入了一种利用有向无环图(DAG)来管理草拟假设的创新方法。该结构使我们能够高效地预测和合并重复出现的标记序列,从而大幅降低草拟模型的计算需求。我们将此方法命名为图结构推测解码(GSD)。我们将GSD应用于一系列LLM,包括一个700亿参数的LLaMA-2模型,并观察到1.73倍至1.96倍的显著加速,显著超越了标准推测解码。