Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.
翻译:推测解码(SD)通过允许轻量级草稿模型生成输出,再由更强的目标模型进行验证,从而加速大语言模型推理。然而,其以词元为中心的机制会导致错误步骤的传播。现有方法借助外部奖励模型缓解该问题,但会带来额外的延迟、计算开销并限制泛化能力。我们提出SpecGuard——一种验证感知型推测解码框架,该框架仅利用模型内部信号实现步骤级验证。在每个步骤中,SpecGuard采样多个草稿候选并选择最一致的步骤,随后通过两个轻量级模型内部信号的集成进行验证:(i)基于注意力的归因得分,用于衡量对输入及先前已接受步骤的归因程度;(ii)基于对数概率的得分,用于捕获词元级置信度。这些信号共同决定该步骤是被接受还是使用目标模型重新计算,从而选择性分配计算资源。在多种推理基准上的实验表明,SpecGuard在降低约11%延迟的同时将准确率提升3.6%,性能优于标准SD及奖励引导型SD。