Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences, which the target LLM verifies in parallel. However, current heuristic approaches, such as Recursive Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Meanwhile, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for real-time use. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead. By simplifying the OTM problem into a compact Linear Programming model, SpecHub significantly reduces computational complexity. It further accelerates sampling by leveraging a sparse joint distribution, focusing computation on high-probability token sequences. In extensive experiments, Spechub consistently generates 0.05-0.27 and 0.02-0.16 more tokens per step than RRS and RRS without replacement. We attach our code at \url{https://github.com/MasterGodzilla/Speculative_decoding_OT}.
翻译:大型语言模型(LLMs)已成为推动自然语言处理(NLP)任务发展的关键技术,但其顺序生成令牌的方式限制了推理速度。多草稿推测解码(MDSD)通过使用较小的草稿模型生成多个令牌序列,并由目标LLM并行验证,提供了一种有前景的解决方案。然而,当前启发式方法(如递归拒绝采样(RRS))在后续草稿中接受率较低,限制了使用多草稿的优势。同时,基于成员成本的最优传输(OTM)理论上可提高接受率,但其计算成本过高,难以实时应用。本文提出SpecHub,一种新颖高效的MDSD采样-验证方法,仅以线性计算开销即可提升接受率。通过将OTM问题简化为紧凑的线性规划模型,SpecHub显著降低了计算复杂度。该方法进一步利用稀疏联合分布加速采样,将计算集中于高概率令牌序列。在大量实验中,SpecHub每步生成的令牌数比RRS及无放回RRS分别多0.05-0.27和0.02-0.16。代码已发布于\url{https://github.com/MasterGodzilla/Speculative_decoding_OT}。