Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
翻译:强化学习(RL)是大语言模型(LLMs)后训练中的关键阶段,涉及轨迹生成、奖励评估和集中学习的反复交互。将轨迹生成执行分布化,为利用更具成本效益的推理资源提供了机遇,但也带来了广域协调与策略分发方面的挑战。我们提出了ECHO-2,一个面向后训练场景、采用远程推理工作节点且存在不可忽视分发延迟的分布式RL框架。ECHO-2将集中式学习与分布式轨迹生成相结合,将有界的策略陈旧性视为用户可控参数,从而实现轨迹生成、分发与训练的重叠执行。我们引入了一种基于重叠的容量模型,该模型关联了训练时间、分发延迟和轨迹生成吞吐量,并推导出维持学习器利用率的高效资源供给规则。为缓解分发瓶颈并降低开销,ECHO-2采用了辅助节点参与的流水线广播机制与成本感知的异构工作节点激活策略。在真实广域网带宽条件下对4B和8B模型进行GRPO后训练的实验中,ECHO-2在保持与强基线模型相当的RL奖励的同时,显著提升了成本效率。