Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient experiments demonstrate the efficacy of our method. Our code is available at https://github.com/niyunsheng/EMS-SD.
翻译:投机解码作为提升大语言模型(LLMs)推理速度的关键技术而备受关注。尽管近期研究致力于提升预测效率,但由于验证阶段批次内不同样本接受的令牌数量存在差异,多样本投机解码问题长期被忽视。传统方法需添加填充令牌以确保各样本的新增令牌数量保持一致,但这将显著增加计算与内存访问开销,从而降低加速比。本文提出一种无需增加内存或计算开销即可解决不同样本接受令牌数量不一致问题的新方法。此外,所提方法还能在处理不同样本预测令牌不一致场景时避免使用填充令牌。充分实验证明了本方法的有效性。我们的代码已开源至 https://github.com/niyunsheng/EMS-SD。