RollArt: Disaggregated Multi-Task Agentic RL Training at Scale

Wei Gao,Yuheng Zhao,Tianyuan Wu,Shaopan Xiong,Weixun Wang,Dakai An,Lunxi Cao,Dilxat Muhtar,Zichen Liu,Haizhou Zhao,Ju Huang,Siran Yang,Yongbin Li,Wenbo Su,Jiamang Wang,Lin Qu,Bo Zheng,Wei Wang

from arxiv, 19 pages, 15 figures

Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty reward evaluation. Existing systems either colocate all stages on a single GPU cluster or decouple them only at a coarse granularity, overlooking hardware heterogeneity and incurring substantial synchronization overhead across stages. We present ROLLART, a system for multi-task agentic RL on disaggregated infrastructure. ROLLART maps each pipeline stage to best-fit hardware, routing prefill-heavy tasks to compute-optimized GPUs, decode-heavy tasks to bandwidth-optimized GPUs, and environments to CPU clusters. It decouples rollout at the trajectory level, allowing generation, environment interaction, and reward scoring to proceed independently, so that slow or failed environments never block the others. ROLLART offloads stateless reward computation to serverless infrastructure and overlaps rollout with training via staleness-bounded asynchronous weight synchronization. Our results demonstrate that ROLLART effectively improves training throughput and achieves 1.31--2.05 \(\times\) training time reduction compared to various RL systems. We also evaluated ROLLART by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with above 3,000 GPUs, demonstrating its stability and scalability.

翻译：智能体强化学习通过与环境的多次交互训练大语言模型，产生混合计算密集的预填充、带宽密集的解码、CPU密集的环境执行以及突发性奖励评估的工作负载。现有系统要么将所有阶段部署在单一GPU集群上，要么仅以粗粒度进行解耦，忽视了硬件异构性并导致各阶段间显著的同步开销。我们提出ROLLART——一种面向去中心化基础设施的多任务智能体强化学习系统。该系统将每个流水线阶段映射至最佳适配硬件：将预填充密集型任务路由至计算优化型GPU，解码密集型任务路由至带宽优化型GPU，环境执行任务分配至CPU集群。ROLLART在轨迹层面解耦滚动执行，使生成、环境交互与奖励评分可独立进行，从而避免缓慢或失败的环境阻塞其它任务。系统将无状态奖励计算卸载至无服务器基础设施，并通过带陈旧性约束的异步权重同步实现滚动与训练的重叠。实验结果表明，与多种强化学习系统相比，ROLLART有效提升训练吞吐量，并实现1.31–2.05倍的训练时间缩减。我们还在阿里云超过3000个GPU的集群上，通过训练用于Qoder产品的千亿参数混合专家模型对ROLLART进行了评估，验证了其稳定性与可扩展性。