Test-Time Speculation - 专知论文

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

翻译：投机推测通过使用快速草稿模型生成令牌，再由更精确的目标模型进行验证，从而加速大型语言模型推断。其性能取决于$\textit{接受长度}$，即目标模型接受的草稿令牌数量。我们的研究表明，即使是最先进的投机推测器（如DFlash、EAGLE-3和PARD），其接受长度也会随着生成长度增加而退化，在仅数千个输出令牌后便接近1（即无加速效果），这使得投机推测器在长响应任务中失效。接受长度下降的原因在于，大多数投机推测器是在短序列上离线训练的，但在推断时被迫匹配目标模型在更长输出上的表现，这远远超出了它们的训练分布。为解决此问题，我们提出$\textit{测试时投机推测（TTS）}$，一种在测试时持续调整投机推测器的在线蒸馏方法。TTS利用一个关键洞察：令牌验证步骤已为每个草稿令牌调用目标模型，这提供了无需额外成本即可调整草稿所需的训练信号。TTS将草稿视为学生、目标视为教师，在多个推测轮次中调整草稿，随着生成过程的推进，每次更新都会提升草稿的准确性。我们在来自Qwen-3、Qwen-3.5和Llama3.1家族的多个模型上取得的实验结果表明，与最先进的投机推测器相比，TTS将接受长度平均提升了高达72%和41%，且其优势随生成长度的增加而扩大。