When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang,Fengxiang Bie,Jisen Li,Zhongzhu Zhou,Zelei Shao,Yubo Wang,Yinghui Liu,Qingyang Wu,Avner May,Sri Yanamandra,Yineng Zhang,Ce Zhang,Tri Dao,Percy Liang,Ben Athiwaratkun,Shuaiwen Leon Song,Chenfeng Xu,Xiaoxia Wu

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

翻译：推测解码能够显著加速大语言模型的服务，但当前大多数部署将推测器训练与服务解耦，将推测器训练视为独立的离线建模问题。我们证明这种解耦的范式引入了显著的部署与适应延迟：(1) 高服务就绪时间，因为推测器必须在部署前离线训练相当长的时间；(2) 延迟的效用反馈，因为真实的端到端解码加速仅在训练后才能获知，且无法仅从接受率可靠推断，这是由于模型架构和系统级开销的存在；(3) 领域漂移导致的性能下降，当目标模型被重新用于新领域时，推测器会变得陈旧且效果降低。为解决这些问题，我们提出了Aurora，一个统一的训练-服务系统，它通过直接从实时推理轨迹中持续学习推测器来形成闭环。Aurora将在线推测器学习重新定义为异步强化学习问题：被接受的标记提供正向反馈，而被拒绝的推测器提议则提供隐含的负向反馈，我们利用后者来提高样本效率。我们的设计将基于SGLang的推理服务器与异步训练服务器集成，支持推测器的热插拔更新而无需中断服务。至关重要的是，Aurora支持第0天部署：推测器可以立即提供服务并快速适应实时流量，在提升系统性能的同时提供即时的效用反馈。在各项实验中，Aurora在近期发布的前沿模型（例如MiniMax M2.1 229B和Qwen3-Coder-Next 80B）上实现了1.5倍的第0天加速。Aurora还能有效适应用户流量的分布变化，在广泛使用的模型（例如Qwen3和Llama3）上，相比训练良好但静态的推测器，额外提供了1.25倍的加速。