Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.

翻译：强化学习（RL）对于提升大语言模型（LLM）的复杂推理能力至关重要。然而，现有的RL训练流程计算效率低下且资源密集，其中模拟阶段占用了总训练时间的70%以上。量化RL训练，特别是使用FP8精度，为解决这一瓶颈提供了一种有前景的途径。一种普遍采用的策略是在模拟阶段应用FP8精度，同时在训练阶段保留BF16精度。在本工作中，我们首次对FP8 RL训练进行了全面研究，并证明了广泛使用的“BF16训练 + FP8模拟”策略在长时程模拟和具有挑战性的任务下，会遭受严重的训练不稳定性和灾难性的精度崩溃。我们的分析表明，这些失败源于该方法的离策略本质，其在训练与推理之间引入了显著的数值不匹配。基于这些观察，我们提出了Jet-RL，一个FP8 RL训练框架，能够实现稳健且稳定的RL优化。其核心思想是采用统一的FP8精度流进行训练和模拟，从而最小化数值差异，并消除了低效的步骤间校准需求。大量实验验证了Jet-RL的有效性：我们的方法在模拟阶段实现了高达33%的加速，在训练阶段实现了高达41%的加速，并且相比BF16训练实现了16%的端到端加速，同时在所有设置下保持稳定的收敛，并带来可忽略的精度损失。