RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Reinforcement learning (RL) has become essential for unlocking advanced reasoning capabilities in large language models (LLMs). RL workflows involve interleaving rollout and training stages with fundamentally different resource requirements. Rollout typically dominates overall execution time, yet scales efficiently through multiple independent instances. In contrast, training requires tightly-coupled GPUs with full-mesh communication. Existing RL frameworks fall into two categories: co-located and disaggregated architectures. Co-located frameworks fail to address this resource tension by forcing both stages to share the same GPUs. Disaggregated architectures, without modifications of well-established RL algorithms, suffer from resource under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances on public clouds and spare capacity in production clusters, present significant cost-saving opportunities for accelerating RL workflows, if efficiently harvested for rollout. In this paper, we present RLBoost, a framework for cost-efficient RL training that harvests preemptible GPU resources. Our key insight is that rollout's stateless and embarrassingly parallel nature aligns perfectly with preemptible and often fragmented resources. To efficiently utilize these resources despite frequent and unpredictable availability changes, RLBoost adopts a hybrid architecture with three key techniques: (1) adaptive rollout offload to dynamically adjust workloads on the reserved (on-demand) cluster, (2) pull-based weight transfer that quickly provisions newly available instances, and (3) token-level response collection and migration for efficient preemption handling and continuous load balancing. Extensive experiments show RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.

翻译：强化学习（RL）已成为解锁大语言模型（LLMs）高级推理能力的关键技术。RL工作流涉及交替进行的生成（rollout）与训练阶段，两者对资源的需求存在本质差异。生成阶段通常主导整体执行时间，但可通过多个独立实例高效扩展；而训练阶段则需要具有全互联通信能力的紧耦合GPU集群。现有RL框架可分为两类：共置架构与分离架构。共置架构强制两阶段共享相同GPU，未能解决资源冲突问题；分离架构若不对成熟RL算法进行修改，则会面临资源利用率不足的困境。与此同时，可抢占GPU资源（如公有云上的竞价实例和生产集群中的闲置算力）为加速RL工作流提供了显著的成本节约机会——前提是高效利用这些资源执行生成阶段。本文提出RLBoost框架，通过利用可抢占GPU资源实现经济高效的RL训练。我们的核心洞察在于：生成阶段的无状态与高度可并行特性，与可抢占且通常碎片化的资源特性完美契合。为应对这些资源频繁且不可预测的可用性变化，RLBoost采用混合架构并引入三项关键技术：（1）自适应生成卸载——动态调整预留（按需）集群上的工作负载；（2）基于拉取的权重传输——快速配置新可用实例；（3）令牌级响应收集与迁移——实现高效的抢占处理与持续负载均衡。大量实验表明，相比仅使用按需GPU资源，RLBoost可将训练吞吐量提升1.51-1.97倍，同时将成本效率提高28%-49%。