Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.
翻译:推测解码通过并行验证草稿令牌来加速大语言模型推理。然而,该方法存在一个关键权衡:在低负载、内存受限的系统中能提升吞吐量,但在高负载、计算受限的环境中,由于验证开销反而会降低性能。当前推测解码实现采用固定的推测长度,无法适应动态请求率,在实际服务场景中造成显著性能瓶颈。为克服此问题,我们提出夜鹰——一种基于学习的自适应推测推理算法,该算法通过动态选择不同批次大小的最优推测长度,甚至在推测解码无益时将其禁用,从而适应请求负载。实验表明,与标准推测解码相比,夜鹰可实现高达14.8%的吞吐量提升和20.2%的延迟降低,展现出实时服务场景下的强劲效率。