An Interpretable Latency Model for Speculative Decoding in LLM Serving

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

翻译：推测解码（SD）通过使用小型草稿模型并行提出多个令牌，再由大型目标模型进行验证，从而加速大语言模型（LLM）推理。尽管先前的研究在孤立或固定批次设置中展示了显著的加速效果，但SD在生产服务系统中的行为仍缺乏深入理解：请求负载随时间动态变化，有效批次大小由服务系统本身决定，而非直接可控或可观测。本研究为LLM服务中的SD开发了一种简洁且可解释的延迟模型。我们利用利特尔法则从请求速率推断有效批次大小，并将每个请求的处理需求分解为负载无关与负载相关两部分，分别对应预填充、草稿生成与验证阶段。通过vLLM框架在验证器与草稿模型规模、预填充和解码长度、请求速率、草稿长度及接受概率等多维度下的广泛测量，验证了模型的有效性。该模型准确描述了观测延迟，解释了加速比通常随服务器负载增加而衰减的原因，并刻画了草稿长度、接受率及验证器-草稿模型规模如何在不同服务条件下影响延迟——这对部署系统中的SD配置具有重要指导意义。我们进一步展示了该框架如何扩展至混合专家模型（MoE），其中稀疏专家激活机制会改变不同负载场景下的有效服务成本。综合而言，我们的成果为理解真实LLM服务系统中的SD提供了结构化分析框架。