Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.
翻译:大规模模型严重依赖3D并行进行分布式训练,其中利用张量并行(TP)作为算子内并行机制,将模型状态划分到多个GPU上。然而,TP会引入显著的通信开销,且修改单GPU代码的复杂度较高。本文提出一种无TP的分布式框架ZeroPP,该框架通过结合可扩展的算子间流水线并行与算子内全分片数据并行,实现大规模模型训练,在降低内存消耗的同时获得较高的训练效率。通过大量实验验证,ZeroPP相比传统3D并行方法在保持相近GPU内存消耗的前提下,最高可获得33%的显著性能提升。