Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination

Reinforcement learning (RL) post-training has become pivotal for enhancing the capabilities of modern large models. A recent trend is to develop RL systems with a fully disaggregated architecture, which decouples the three RL phases (rollout, reward, and training) onto separate resources and executes them asynchronously. However, two critical data-level concerns arise: (1) asynchronous execution leads to data staleness in trajectories (the data generated by rollout) as the model parameters used in rollout may not be up to date, which impairs RL convergence; and (2) the length variation of trajectories introduces severe data skewness, leading to workload imbalance and degraded system performance. Existing systems fail to address these two concerns in a unified manner. Techniques that tightly control data staleness often constrain effective data skewness mitigation, while aggressive data skewness mitigation tends to exacerbate data staleness. As a result, systems are forced to trade off convergence for performance, or vice versa. To address this, we propose StaleFlow, an RL post-training system that jointly tackles data staleness and skewness. First, to control staleness, StaleFlow introduces a global consistency protocol that tracks the full lifecycle of each trajectory and constrains staleness. Second, to mitigate skewness, StaleFlow re-designs the RL system architecture by constructing data servers for trajectories and parameters to achieve flexible rollout coordination. Subsequently, we develop a suite of staleness-aware, throughput-oriented strategies to enhance system performance. Evaluations show that StaleFlow achieves up to 1.42-2.68$\times$ (1.17-2.01$\times$ on average) higher throughput than state-of-the-art systems, without compromising convergence.

翻译：强化学习（RL）后训练已成为提升现代大型模型能力的关键。近期趋势是开发具有完全解耦架构的RL系统，该架构将RL的三个阶段（轨迹生成、奖励计算和训练）解耦到独立的资源上并异步执行。然而，两个关键的数据层面问题随之出现：（1）异步执行导致轨迹（由轨迹生成阶段产生的数据）中的数据陈旧性，因为用于轨迹生成的模型参数可能不是最新的，这会损害RL的收敛性；（2）轨迹长度的变化引入了严重的数据偏斜，导致工作负载不平衡和系统性能下降。现有系统未能以统一的方式解决这两个问题。严格控制数据陈旧性的技术通常会限制有效的数据偏斜缓解，而激进的数据偏斜缓解往往会加剧数据陈旧性。因此，系统被迫在收敛性和性能之间做出权衡，反之亦然。为解决此问题，我们提出了StaleFlow，一个联合处理数据陈旧性和偏斜性的RL后训练系统。首先，为控制陈旧性，StaleFlow引入了一种全局一致性协议，该协议跟踪每个轨迹的完整生命周期并约束其陈旧性。其次，为缓解偏斜性，StaleFlow重新设计了RL系统架构，通过构建用于轨迹和参数的数据服务器来实现灵活的轨迹协调。随后，我们开发了一套陈旧性感知、面向吞吐量的策略以提升系统性能。评估表明，StaleFlow在不影响收敛性的前提下，其吞吐量比最先进的系统最高可提升1.42-2.68倍（平均提升1.17-2.01倍）。