ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Wei Gao,Yuheng Zhao,Dilxat Muhtar,Dakai An,Xuchun Shang,Tianyuan Wu,Lunxi Cao,Shaopan Xiong,Weixun Wang,Ju Huang,Teng Ma,Siran Yang,Jiamang Wang,Lin Qu,Bo Zheng,Wei Wang

from arxiv, 19 pages, 15 figures

Agentic reinforcement learning (RL) has emerged as a key driver for improving the multi-step reasoning and tool-use capabilities of LLMs. However, its efficiency is bottlenecked by long-tail rollouts with multi-turn environment interactions, making static GPU provisioning a poor fit: overprovisioning wastes GPUs on stragglers, while underprovisioning increases contention and slows training. We observe that production serving clusters routinely leave substantial GPU compute and memory headroom. Based on this observation, we argue for cooperative elasticity: opportunistically repurposing underutilized serving GPUs to execute rollouts. Realizing cooperative elasticity is non-trivial because it must preserve serving Service Level Objectives (SLOs) under bursty traffic and minimize communication overhead. To address these challenges, we present ROSE, a cooperative, resource-elastic post-training system that safely harvests idle compute and memory on serving GPUs to accelerate agentic RL rollouts. ROSE consists of three components: (1) an SLO-safe co-serving executor that improves rollout throughput while preserving serving SLOs through efficient GPU memory and compute sharing; (2) a cross-cluster weight transfer engine that leverages weight shards and sparsity for fast weight synchronization across clusters; and (3) an elastic rollout scheduler that dynamically provisions cooperative capacity and routes trajectory rollouts across dedicated rollout GPUs and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves average end-to-end throughput by 1.20-3.31 x compared with state-of-the-art resource-fixed and elastic baselines.

翻译：智能体强化学习已成为提升大语言模型多步推理与工具调用能力的关键驱动力。然而，其效率受限于涉及多轮环境交互的长尾推演过程，使得静态GPU资源配置难以适配：过度配置会在掉队节点上浪费GPU资源，而配置不足则会增加竞争并减慢训练速度。我们观察到生产环境中的服务集群通常保留大量空闲GPU算力和内存。基于这一观察，我们提出协作弹性机制：机会性地复用利用不足的服务GPU来执行推演任务。实现协作弹性面临显著挑战，因为系统必须保证突发流量下的服务等级目标（SLO）并最小化通信开销。为解决这些问题，我们提出ROSE——一个协作式、资源弹性的后训练系统，能够安全地利用服务GPU上的空闲算力与内存来加速智能体强化学习的推演过程。ROSE包含三个组件：(1)SLO安全协同执行器，通过高效的GPU内存与计算共享技术，在保证服务SLO的同时提升推演吞吐量；(2)跨集群权重传输引擎，利用权重分片与稀疏性实现跨集群的快速权重同步；(3)弹性推演调度器，动态配置协作容量，并将轨迹推演任务路由至专用推演GPU与机会性服务GPU。在多种模型规模与集群规模下的实验表明，与最先进的资源固定与弹性基线相比，ROSE将平均端到端吞吐量提升了1.20-3.31倍。