Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.
翻译:现代大语言模型(LLM)强化学习(RL)训练任务需要高效的权重传输系统以支持跨异构计算资源的扩展。然而,现有权重传输方法或无法提供动态扩展集群所需的灵活性,或存在根本性的数据移动开销导致性能低下。我们提出面向引用的存储(ROS)——针对RL权重传输的新型存储抽象,其核心思想是原位利用高度复用的模型权重。ROS在逻辑上模拟特定版本模型权重被存储并可按需获取的假象。底层实现中,ROS并不物理存储任何权重副本,而是追踪在GPU上持有这些权重的工作节点用于推理。收到请求时,ROS直接使用这些节点服务读取操作。基于此思想,我们构建了生产级系统TensorHub,通过拓扑优化传输、强一致性和容错机制扩展ROS理念。评估表明,TensorHub能完全饱和RDMA带宽,并以最小工程代价适配三种不同的采样部署场景。具体而言,TensorHub将独立采样场景的GPU停滞总时间降低6.7倍,将弹性采样场景的权重更新加速4.8倍,并将跨数据中心采样停滞时间削减19倍。TensorHub已部署至生产环境支持前沿强化学习训练。