Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.
翻译:强化学习(RL)是大型语言模型(LLM)后训练阶段的关键组成部分。然而,在带宽受限的分布式RL中,可扩展性常因策略权重从训练器同步至推理工作节点而受限,尤其是在商用网络或去中心化环境中。尽管近期研究表明RL更新仅修改模型参数的一小部分,但这些观察通常基于粗略的检查点差异。我们对单步及多步粒度下的权重更新稀疏性进行了系统性实证研究,考察其在训练动态、离策略延迟及模型规模下的演变规律。我们发现更新稀疏性始终维持在较高水平,在实际相关设置中经常超过99%。利用这一结构特性,我们提出了PULSE(基于无损稀疏编码的补丁更新),这是一种简单高效的无损权重同步方法,仅传输已修改参数的索引与数值。PULSE对传输错误具有鲁棒性,并避免了累加增量方案固有的浮点数漂移问题。在带宽受限的去中心化环境中,相较于全权重同步方案,我们的方法实现了超过100倍(从14 GB降至约108 MB)的通信量缩减,同时保持比特级一致的训练动态与性能。通过挖掘此结构特性,PULSE使去中心化RL训练能够逼近中心化吞吐量,将维持高GPU利用率所需的权重同步带宽从20 Gbit/s降低至0.2 Gbit/s。