LLM post-training with reinforcement learning (RL) requires frequent synchronization of large model parameters between the trainer and distributed rollout actors. High-throughput RL post-training therefore relies on dedicated RDMA HPC clusters, an infrastructure cost most organizations cannot absorb. A natural alternative is to aggregate loosely-coupled GPUs over standard Ethernet and WAN links, but this commodity connectivity cannot sustain full-weight broadcasts: synchronizing an 8B model can take over 100~seconds on bandwidth-limited links, while rollout generation typically takes tens of seconds. Toward making RL practical in this regime, we observe that RL fine-tuning yields highly sparse per-step updates, with only around 1\% of parameter elements changing. Atop this insight, we present SparrowRL, a novel high-performance RL training system that preserves bit-exact updates without dropping or quantizing information, designed for commodity-networked, loosely-coupled GPU resources. SparrowRL represents each step as a sparse delta checkpoint, pipelines delta extraction with multi-stream transmission, overlaps transfer with rollout generation, and coordinates heterogeneous workers with throughput- and bandwidth-aware scheduling plus lease-based fault tolerance. On Qwen3 models from 4B to 14B deployed across up to four geographic regions, SparrowRL reduces per-step transfer payload by 79$\times$ for Qwen3-8B and improves throughput by 2.4--9.5$\times$ over full-weight broadcast across WAN, narrowing the throughput gap relative to an ideal RDMA single-datacenter baseline to within 8.91\%. By leveraging on-demand, cross-cloud GPUs over commodity links, SparrowRL delivers 1.21--1.59$\times$ higher tokens per dollar than reserved RDMA clusters at comparable throughput.
翻译:基于强化学习(RL)的大语言模型(LLM)后训练需要在训练器与分布式推演执行器之间频繁同步大型模型参数。因此,高吞吐量的RL后训练依赖于专用的RDMA高性能计算集群,这种基础设施成本是大多数组织难以承担的。一种自然的替代方案是通过标准以太网和广域网链路聚合松散耦合的GPU,但这种商用连接无法支撑全权重广播:在带宽受限的链路上同步一个80亿参数的模型可能需要超过100秒,而推演生成通常只需数十秒。为了使RL在此场景下变得实用,我们观察到RL微调会产生高度稀疏的每步更新,仅有约1%的参数元素发生变化。基于这一洞察,我们提出了SparrowRL,一个新颖的高性能RL训练系统,它能在不丢弃或量化信息的情况下保持比特精确的更新,专为商用网络连接的松散耦合GPU资源设计。SparrowRL将每一步表示为稀疏增量检查点,将增量提取与多流传输流水线化,使传输与推演生成重叠,并通过吞吐量与带宽感知的调度以及基于租约的容错机制来协调异构工作节点。在部署于最多四个地理区域的Qwen3模型(规模从40亿到140亿参数)上,SparrowRL将Qwen3-8B的每步传输负载降低了79倍,并在广域网上相比全权重广播将吞吐量提升了2.4至9.5倍,将相对于理想的RDMA单数据中心基线的吞吐量差距缩小至8.91%以内。通过在商用链路上利用跨云按需GPU,SparrowRL在可比吞吐量下,实现了比预留RDMA集群高1.21至1.59倍的每美元处理令牌数。