Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).
翻译:推测解码能够显著加速大语言模型的服务,但当前大多数部署将推测器训练与服务解耦,将推测器训练视为独立的离线建模问题。我们证明这种解耦的范式引入了显著的部署与适应延迟:(1) 高服务就绪时间,因为推测器必须在部署前离线训练相当长的时间;(2) 延迟的效用反馈,因为真实的端到端解码加速仅在训练后才能获知,且无法仅从接受率可靠推断,这是由于模型架构和系统级开销的存在;(3) 领域漂移导致的性能下降,当目标模型被重新用于新领域时,推测器会变得陈旧且效果降低。为解决这些问题,我们提出了Aurora,一个统一的训练-服务系统,它通过直接从实时推理轨迹中持续学习推测器来形成闭环。Aurora将在线推测器学习重新定义为异步强化学习问题:被接受的标记提供正向反馈,而被拒绝的推测器提议则提供隐含的负向反馈,我们利用后者来提高样本效率。我们的设计将基于SGLang的推理服务器与异步训练服务器集成,支持推测器的热插拔更新而无需中断服务。至关重要的是,Aurora支持第0天部署:推测器可以立即提供服务并快速适应实时流量,在提升系统性能的同时提供即时的效用反馈。在各项实验中,Aurora在近期发布的前沿模型(例如MiniMax M2.1 229B和Qwen3-Coder-Next 80B)上实现了1.5倍的第0天加速。Aurora还能有效适应用户流量的分布变化,在广泛使用的模型(例如Qwen3和Llama3)上,相比训练良好但静态的推测器,额外提供了1.25倍的加速。