Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific meanings.Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.
翻译:大型语言模型在许多任务上表现出色,但由于其自回归解码特性,推理延迟较高。这一问题在大型推理模型中尤为突出,因为它们需要生成冗长的思维链。推测解码通过并行草拟和验证多个词元来加速推理,但现有方法仅在词元层面操作,忽略了语义等价性(即不同词元序列可能表达相同含义),导致低效的拒绝。我们提出了SemanticSpec,一种语义感知的推测解码框架,它验证整个语义序列而非单个词元。SemanticSpec引入了一种语义概率估计机制,通过探测模型内部隐藏状态来评估生成特定含义序列的可能性。在四个基准测试上的实验表明,SemanticSpec在DeepSeekR1-32B上实现了高达2.7倍的加速,在QwQ-32B上实现了2.1倍的加速,在效率和效果上均持续优于词元级和序列级基线方法。