RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.
翻译:前沿语言模型的强化学习后训练日益受限于自回归推演生成,这使得推演加速成为核心系统挑战。现有多种效率方法通过改变推演或优化范式来提升吞吐量,例如采用离策略执行、经验回放或低精度生成。我们研究将推测解码作为强化学习推演的无损加速原语,该方法可保持目标模型的输出分布。我们在基于vLLM后端的NeMo-RL中实现了推测解码,支持同步与异步流水线,能够实现强化学习推演期间的推测过程。该优势可应用于各类推测机制,例如预训练MTP头部、小型外部草稿模型,甚至传统上在强化学习阶段之后使用的Eagle3等先进技术。这为在强化学习训练中部署最先进的推测解码提供了实现路径。在8B规模的同步强化学习推理后训练任务中,推测解码将推演吞吐量提升了1.8倍。通过高保真性能模拟器,我们预测在235B规模下,将推测解码与异步强化学习相结合最高可实现2.5倍的端到端训练加速。