SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.

翻译：LLM服务平台越来越多地部署为多模型云系统，其中用户需求通常呈长尾分布：少数流行的大模型接收大部分请求，而许多较小的尾部模型未被充分利用。我们提出\textbf{SPECTRE}（基于多租户远程草稿模型的并行\textbf{推测}解码框架），一种通过推测解码将未充分利用的尾部模型服务重用为重负载大模型服务的远程草稿生成器的服务框架。SPECTRE支持草稿生成与目标端验证并行执行，并通过三项技术实现并行有效性：基于吞吐量分析导出的阈值的混合串行-并行推测解码策略、在多租户流量下保留草稿-目标重叠的推测优先级调度，以及降低草稿端延迟的草稿提示压缩。我们在\texttt{SGLang}中实现了SPECTRE，并在多组草稿-目标模型对、推理基准、真实长上下文工作负载以及广泛批次大小下进行了评估。结果表明，SPECTRE持续提升了大模型服务吞吐量，同时对尾部模型服务的原生负载仅造成轻微干扰。在包括Qwen3-235B-A22B（TP=8）的大模型部署中，SPECTRE相比自回归解码实现了高达\textbf{2.28倍加速}，相比最强的推测解码基线额外实现了\textbf{66\%的相对提升}。空谈无益，代码在此：https://github.com/sgl-project/sglang/pull/22272。