Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.
翻译:投机解码有望加速大语言模型的推理,但在生产级部署中其效能常会下降。现有评估通常忽视高并发场景中计算密集型的特性,此时验证计算成为主要瓶颈。因此,以往方法面临两难困境:静态树引发大量验证浪费,而动态树则遭受累积误判与内核不兼容问题。为弥合这一鸿沟,我们提出ECHO——一个集成至SGLang的高并发导向框架,将投机执行重构为预算调度问题。关键地,ECHO采用稀疏置信度门控,将批次管理为统一超树,在深度与宽度间弹性分配预算,以协同优化减少全局验证步骤与最大化每步效率之间的权衡。跨越多模型规模(尤其是工业级Qwen3-235B)的广泛评估表明,ECHO在低负载与高负载场景下均持续优于现有最优方法,实现高达5.35倍端到端加速,并带来超过20%的相对加速收益。