Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework \textbf{SpecBranch} to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we strategically introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ speedups against the auto-regressive decoding and reduces rollback tokens by $\textbf{50}$\% for poorly aligned models, realizing its applicability for real-world deployments.
翻译:近期,推测解码(SD)通过使用小型草稿模型预先生成草稿令牌,并与大型目标模型并行验证,已成为加速大语言模型推理的有效技术。然而,现有SD方法仍受限于串行执行模式,导致草稿模型与目标模型之间的相互等待气泡。针对这一挑战,我们从现代处理器的分支预测中汲取灵感,提出新型框架\textbf{SpecBranch}以解锁SD中的分支并行性。具体而言,我们首先深入分析SD中分支并行性的潜力,识别出关键挑战在于并行化与令牌回滚之间的权衡。基于此分析,我们策略性地引入并行推测分支以主动对冲可能的拒绝事件。同时,为增强并行性,我们通过隐式草稿模型置信度与显式目标模型特征复用的混合方式,联合协调自适应草稿长度。跨多种模型与基准的广泛实验表明,SpecBranch相较于自回归解码实现\textbf{1.8}$\times \sim$\textbf{4.5}$\times$的加速,并为弱对齐模型减少\textbf{50}\%的回滚令牌,验证了其在真实部署场景中的适用性。