Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable speedup.In this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this "memory wall" by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fidelity while effectively halving memory traffic. Extensive experiments on state-of-the-art models (e.g., OpenPangu and Qwen3) demonstrate that Quasar maintains a speculative acceptance length comparable to full-precision methods while achieving a $1.28\times$ improvement in end-to-end throughput. Being orthogonal to existing drafting strategies, Quasar offers a generic and efficient pathway to accelerate the verification leg of speculative execution. Code is available at https://github.com/Tom-HG/Quasar.
翻译:推测解码已成为加速大语言模型推理的关键技术,其通过将令牌生成解耦为快速草稿生成与并行验证两个阶段。尽管自推测与前瞻解码的最新进展已成功最小化草稿生成开销,但性能瓶颈已转移至验证阶段。由于验证过程需执行目标模型的完整前向传播,其始终受限于内存带宽,这从根本上限制了可达到的最大加速比。本文提出 **Quasar**(**Qua**ntized **S**elf-speculative **A**cceleration for **R**apid Inference),一种无需训练的新型框架,通过专门在验证阶段采用低位量化来突破这一“内存墙”。我们的实证分析表明,激进的结构化剪枝会显著降低验证精度,而基于量化的验证方法能以高保真度保持逻辑值分布,同时有效将内存流量减半。在先进模型上的大量实验表明,Quasar在保持与全精度方法相当的推测接受长度的同时,实现了端到端吞吐量$1.28\times$的提升。由于与现有草稿生成策略正交,Quasar为推测执行的验证环节提供了一条通用高效的加速路径。代码发布于https://github.com/Tom-HG/Quasar。