Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing Speculative Decoding strategies typically rely on static speculative lengths, failing to adapt to fluctuating request loads or identify the optimal moment to halt speculation. The cost of restarting speculative inference also remains unquantified. During traffic surges, the marginal utility of speculation diminishes; yet, the draft model's persistent memory footprint competes for available KV cache. This resource contention limits the maximum batch size, thereby degrading overall system throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, upon detecting significant request queuing or KV cache shortages, it disables speculative decoding and offloads the draft model to the CPU. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves up to 27.29% higher throughput and 12.90% lower latency compared to standard speculative decoding under dynamic request arrival rates in real-time LLM serving scenarios.
翻译:推测解码通过并行验证草稿令牌来加速大语言模型推理。然而,该方法存在一个关键权衡:在低负载、内存受限的系统中能提升吞吐量,但在高负载、计算受限的环境中,由于验证开销反而会降低性能。现有的推测解码策略通常依赖静态推测长度,无法适应波动的请求负载或识别停止推测的最佳时机。重启推测推理的代价也尚未被量化。在流量激增期间,推测的边际效用会减弱;然而,草稿模型持续的内存占用会与可用的键值缓存竞争。这种资源争用限制了最大批处理大小,从而降低了整体系统吞吐量。为克服此问题,我们提出了Nightjar,一个资源感知的自适应推测框架。它首先通过为不同批处理大小动态选择最优推测长度来适应请求负载。关键的是,当检测到显著的请求排队或键值缓存短缺时,它会禁用推测解码并将草稿模型卸载至CPU。这为键值缓存回收了内存,从而支持更大的批处理规模并最大化整体系统吞吐量。实验表明,在实时大语言模型服务场景中,面对动态请求到达率,Nightjar相比标准推测解码实现了高达27.29%的吞吐量提升和12.90%的延迟降低。