Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a sequence of tokens) from a fast draft model that is then verified in parallel by the target model. However, the acceptance rate of candidate tokens receives limitations from several factors, such as the model, the dataset, and the decoding setup. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. Our approach shows significant improvements in acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding.
翻译:大型语言模型在各种自然语言处理任务中展现出令人印象深刻的能力,但其自回归文本生成过程耗时较长。一种加速方法是推测解码,即从快速草稿模型生成候选片段(一系列标记),再由目标模型并行验证。然而,候选标记的接受率受到模型、数据集和解码设置等多方面因素的限制。本文提出从草稿模型中采样多个候选,并将其组织成批次进行验证。我们设计了高效的多候选验证算法,同时保持目标模型的分布特性。该方法在多个数据集和模型上的接受率均有显著提升,持续优于标准推测解码。