Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
翻译:扩展强化学习(RL)在提升大语言模型(LLMs)推理能力方面展现出巨大潜力,特别是在需要长链思维生成的任务中。然而,RL训练效率常受限于轨迹生成阶段——当生成长轨迹(如16k tokens)时,该阶段因缓慢的自回归生成以及轨迹生成与策略更新间的同步开销,可占据总训练时间的70%。我们提出SortedRL,一种在线长度感知调度策略,旨在通过提升轨迹生成效率并保持训练稳定性来解决该瓶颈。SortedRL根据输出长度对轨迹样本重新排序,优先将短样本分组进行早期更新,从而同时实现大尺寸轨迹批次、灵活更新批次及近似在线微观课程构建。为进一步加速流程,SortedRL引入基于缓存的机制控制离线训练程度,并通过专用RL基础设施(含状态化控制器与轨迹缓冲池)管理轨迹生成与更新。在LLaMA-3.1-8B和Qwen-2.5-32B模型上,针对逻辑谜题及AIME 24、Math 500、Minerval等数学挑战任务的实验表明,SortedRL将RL训练气泡比率降低超50%,且在同等数据量下实现比基线高3.9%至18.4%的性能提升。