Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.
翻译:推测解码加速了(M)LLMs的推理过程,但训练与解码之间存在不一致性:现有方法优化的是单一贪婪轨迹,而解码过程涉及对多个采样草稿路径的验证与排序。我们提出了变分推测解码(VSD),将草稿训练形式化为对潜在提议(草稿路径)的变分推断。VSD最大化目标模型接受度的边际概率,推导出一个证据下界,该下界在促进高质量潜在提议的同时,最小化与目标分布的差异。为提升质量并降低方差,我们引入了路径级效用函数,并通过期望最大化过程进行优化。E步从经过Oracle过滤的后验分布中抽取MCMC样本,而M步则使用自适应拒绝加权(ARW)和置信度感知正则化(CAR)来最大化加权似然。理论分析证实VSD提高了期望接受长度和加速比。在多种LLM和MLLM上的大量实验表明,VSD相比EAGLE-3实现了高达9.6%的加速,相比ViSpec实现了7.9%的加速,显著提升了解码效率。