Disaggregating the generation and training stages in RL is widely adopted to scale LLM post-training. There are two critical challenges here. First, the generation stage often becomes a bottleneck due to dynamic workload shifts and severe execution imbalances. Second, the decoupled stages result in diverse and dynamic network traffic patterns that strain the conventional static fabric. We build OrchestrRL to orchestrate dynamically both compute and network in disaggregated RL. OrchestrRL employs an adaptive compute scheduler that adjusts parallelism configuration to match changing workload characteristics within and across generation steps. OrchestrRL adopts a reconfigurable optical-electrical fabric called RFabric: It leverages optical circuit switches to reconfigure the aggregation and core layers of the topology on demand, tailoring bandwidth resources to the unique communication patterns across various phases of training, generation, and weight synchronization. Evaluated on a 64-H800 GPU testbed, OrchestrRL demonstrates up to a 1.42x throughput improvement over static baselines. Using a high-fidelity simulator, we also show that RFabric achieves superior performance-cost efficiency at scale over static Fat-Tree networks.
翻译:在强化学习中解耦生成阶段与训练阶段已被广泛采用以扩展大语言模型的后训练规模。这一范式面临两大关键挑战:首先,由于动态工作负载偏移与严重的执行不均衡,生成阶段常成为系统瓶颈;其次,解耦的多个阶段会产生多样且动态的网络流量模式,对传统静态网络架构造成巨大压力。为此,我们构建了OrchestrRL系统,对解耦强化学习中的计算与网络资源进行动态协同编排。OrchestrRL采用自适应计算调度器,可根据生成步骤内及跨步骤间变化的工作负载特征动态调整并行配置。同时,OrchestrRL部署了名为RFabric的可重构光电混合网络架构:该架构利用光电路交换机按需重构拓扑的汇聚层与核心层,使带宽资源能够精准适配训练、生成及权重同步等不同阶段的独特通信模式。在64卡H800 GPU测试平台上进行评估,OrchestrRL相比静态基线方案实现了最高1.42倍的吞吐量提升。通过高保真仿真实验,我们进一步证明RFabric在大规模场景下相比静态胖树网络具有更优的性能-成本效益。