Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific meanings. Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.
翻译:大型语言模型(LLMs)在众多任务中展现出强大性能,但由于自回归解码机制,其推理延迟较高。这一问题在大型推理模型(LRMs)中尤为突出,因为其需要生成冗长的思维链。尽管推测解码技术通过并行草拟和验证多个词元来加速推理,但现有方法均停留在词元层面,忽略了语义等价性(即不同词元序列可能表达相同含义),导致低效的拒绝判定。我们提出SemanticSpec,一种语义感知的推测解码框架,该框架直接验证完整语义序列而非单个词元。SemanticSpec引入了语义概率估计机制,通过探测模型内部隐藏状态来评估生成特定含义序列的可能性。在四个基准测试上的实验表明,SemanticSpec在DeepSeekR1-32B上实现了最高2.7倍的加速,在QwQ-32B上实现了2.1倍加速,在推理效率和效果上均持续优于词元级与序列级基线方法。