Drafting-then-verifying decoding methods such as speculative decoding are widely adopted training-free methods to accelerate the inference of large language models (LLMs). Instead of employing an autoregressive process to decode tokens sequentially, speculative decoding initially creates drafts with an efficient small model. Then LLMs are required to conduct verification and correction in a non-autoregressive fashion to minimize time overhead. Generating longer drafts can lead to even more significant speedups once verified, but also incurs substantial trial and error costs if it fails. Suffering from the high verification failure probability, existing decoding methods cannot draft too much content for verification at one time, achieving sub-optimal inference acceleration. In this paper, we introduce Ouroboros, which constructs a phrase candidate pool from the verification process of LLMs to provide candidates for draft generation of the small model. Thereby, Ouroboros can further improve the efficiency and effectiveness of the initial drafts. The experimental results on typical text generation tasks show that Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead decoding and speculative decoding, respectively. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.
翻译:草稿-验证解码方法(如推测解码)是一种广泛采用的无训练加速方法,用于加速大型语言模型(LLMs)的推理。与使用自回归过程逐步解码令牌不同,推测解码首先利用高效的小模型生成草稿,随后要求LLMs以非自回归方式进行验证和修正,以最小化时间开销。较长的草稿在验证通过后能带来更显著的加速效果,但一旦验证失败,则会招致大量的试错成本。受限于较高的验证失败概率,现有解码方法无法一次性生成过多草稿内容进行验证,从而导致次优的推理加速效果。本文提出Ouroboros,该方法通过构建LLMs验证过程中的短语候选池,为小模型的草稿生成提供候选内容,从而进一步提升初始草稿的生成效率和效果。在典型文本生成任务上的实验结果表明,与lookahead解码和推测解码相比,Ouroboros实现了高达1.9倍和2.8倍的加速。Ouroboros的源代码可在https://github.com/thunlp/Ouroboros获取。