Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14$\times$-14.3$\times$ speedups over state-of-the-art speculative inference systems.
翻译:大语言模型(LLM)服务在动态请求模式下,常面临实现低推理延迟和满足服务等级目标(SLO)的挑战。推测解码利用轻量级模型进行草稿生成,并由LLM进行验证,已成为加速LLM推理的一项重要技术。然而,现有的推测解码方案通常难以适应变化的工作负载和系统环境,导致性能波动和SLO违规。本文提出了SpecServe,一个高效的LLM推理系统,它能够根据实时请求负载和系统配置动态调整推测策略。SpecServe提出了一个理论模型,用于理解和预测不同场景下推测解码的效率。此外,该系统实现了智能的草稿生成与验证算法,在保证最优性能的同时,实现了较高的SLO达成率。在真实世界LLM请求轨迹上的实验结果表明,SpecServe能够持续满足SLO,并实现了显著的性能提升,相比最先进的推测推理系统获得了1.14$\times$至14.3$\times$的加速比。