Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.
翻译:零冗余优化器(ZeRO)因易用性、高效性和良好可扩展性,已被广泛应用于大规模GPU集群上的各类大语言模型训练。然而,在低带宽集群或迫使每GPU批处理量缩小的规模化训练中,ZeRO因前向传播、反向传播中的权重收集及梯度平均操作需消耗大量通信资源,导致有效吞吐量受限。本文针对ZeRO中的三类通信集合,提出三项通信量削减技术,统称为ZeRO++。其一为基于块量化的全收集技术;其二为以通信换内存的数据重映射技术;其三为基于全对全通信的新型量化梯度平均范式,用以替代归约分散集合,即便传输低精度数据仍能保持精度。ZeRO++总体实现ZeRO通信量缩减4倍,在384 GPU规模下吞吐量提升达2.16倍。