Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.
翻译:训练数十亿参数模型需要使用全分片数据并行(即ZeRO-3)将模型状态分布到多个GPU上。虽然ZeRO-3在具备高带宽NVLink和InfiniBand互连的集群上表现成功,但使用商用硬件的研究人员面临严重的节点间全收集通信瓶颈。现有优化方案主要采取两种路径:GPU内存缓存方案(如MiCS、ZeRO++)以牺牲内存容量为代价降低通信开销,但在大模型上易触发内存溢出故障;主机内存卸载方案(如ZeRO-Offload、ZeRO-Infinity)扩展了容量,却因PCIe开销导致吞吐量下降。我们观察到,在带宽受限的集群中,主机内存不仅可作为溢出层级,更能成为优于节点间通信的快速缓存层。基于此发现,我们提出FCDP系统,在保持ZeRO-3最小GPU内存占用的同时,消除冗余的节点间通信。FCDP将前向传播参数缓存在主机内存中,并通过快速的节点内全收集在反向传播时复用,从而减少50%的节点间全收集通信。对于参数高效微调场景,FCDP选择性通信仅传输可训练参数以最大化缓存效率,将节点间通信流量降低99%以上。在我们的商用集群测试中,FCDP相比ZeRO-3实现最高100倍的吞吐量提升,相比ZeRO++提升51倍,同时保持ZeRO-3的最大批次处理规模。