Double：通过双重检索推测并行突破加速极限 (Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism)

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

翻译：并行推测解码（PSD）通过重叠草稿生成与验证过程，加速了传统推测解码（SD）。然而，它仍受限于两个根本性挑战：（1）由草稿模型与目标模型之间的速度比决定的理论加速上限；（2）因早期错误导致序列中段令牌被拒绝而产生的高计算浪费与流水线停滞。为突破这些限制，我们提出了 \textsc{Double}（双重检索推测并行）。通过弥合 SD 与 PSD 之间的差距，我们的框架通过一种新颖的同步机制解决了检索的 \emph{精度-效率困境}。具体而言，我们使草稿模型能够执行迭代检索推测，以突破理论加速上限；为在不回退的情况下减少拒绝，目标模型执行权威检索以生成多令牌指导。\textsc{Double} 完全无需训练且无损。大量实验表明，在 LLaMA3.3-70B 上实现了 $\textbf{5.3}\times$ 的先进加速，在 Qwen3-32B 上实现了 $\textbf{2.8}\times$ 的加速，显著优于需要大量模型训练的高级方法 EAGLE-3。