Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV-cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves average 27.29% higher throughput and up to 20.18% lower latency compared to standard speculative decoding under dynamic request arrival rates in real-time LLM serving scenarios.
翻译:推测解码通过并行验证草稿令牌来加速大语言模型推理。然而,该方法存在一个关键权衡:在低负载、内存受限的系统中能提升吞吐量,但在高负载、计算受限的环境中,由于验证开销反而会降低性能。现有推测解码方法采用固定长度,无法适应工作负载变化或决定何时停止推测。重启推测推理的代价也尚未量化。在高负载下,推测的收益会减少,而保留草稿模型会降低键值缓存容量,限制批处理大小并损害吞吐量。为解决此问题,我们提出夜鹰,一个资源感知的自适应推测框架。它首先通过为不同批处理大小动态选择最优推测长度来适应请求负载。关键在于,当多臂赌博机规划器判定推测不再有益时,夜鹰会主动禁用推测解码;在禁用阶段,仅在GPU内存压力下将草稿模型卸载至CPU。此举可回收键值缓存的内存,从而支持更大的批处理规模并最大化系统整体吞吐量。实验表明,在实时大语言模型服务场景的动态请求到达率下,与标准推测解码相比,夜鹰平均实现了27.29%的吞吐量提升和最高20.18%的延迟降低。