CO2: Efficient Distributed Training with Full Communication-Computation Overlap

The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication interconnectivity is prohibitively costly, and accessible only to prominent entities. In this work, we aim to lower this barrier and democratize large-scale training with limited bandwidth clusters. We propose a new approach called CO2 that introduces local-updating and asynchronous communication to the distributed data-parallel training, thereby facilitating the full overlap of COmunication with COmputation. CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth. We further propose the staleness gap penalty and outer momentum clipping techniques together with CO2 to bolster its convergence and training stability. Besides, CO2 exhibits seamless integration with well-established ZeRO-series optimizers which mitigate memory consumption of model states with large model training. We also provide a mathematical proof of convergence, accompanied by the establishment of a stringent upper bound. Furthermore, we validate our findings through an extensive set of practical experiments encompassing a wide range of tasks in the fields of computer vision and natural language processing. These experiments serve to demonstrate the capabilities of CO2 in terms of convergence, generalization, and scalability when deployed across configurations comprising up to 128 A100 GPUs. The outcomes emphasize the outstanding capacity of CO2 to hugely improve scalability, no matter on clusters with 800Gbps RDMA or 80Gbps TCP/IP inter-node connections.

翻译：大型语言模型的基础成功依赖于大规模分布式训练技术的有效实现。然而，构建一个配备高速通信互连的大规模高性能集群成本高昂，仅少数顶尖机构能够承担。本研究旨在降低这一门槛，使有限带宽集群上的大规模训练普及化。我们提出一种名为CO2的新方法，将局部更新和异步通信引入分布式数据并行训练中，从而实现通信与计算的完全重叠。即使在通信带宽非常有限的多节点集群上，CO2也能实现高可扩展性。我们进一步提出了与CO2结合的"陈旧间隙惩罚"和"外部动量修剪"技术，以增强其收敛性和训练稳定性。此外，CO2能无缝集成成熟的ZeRO系列优化器，这些优化器可在大规模模型训练中降低模型状态的内存消耗。我们还提供了收敛性的数学证明，并建立了严格的上界。通过涵盖计算机视觉和自然语言处理领域广泛任务的全面实验，我们验证了研究成果。这些实验展示了CO2在多达128个A100 GPU的配置中在收敛性、泛化性和可扩展性方面的能力。结果强调了CO2在极大提升可扩展性方面的卓越能力，无论集群采用800Gbps RDMA还是80Gbps TCP/IP节点间连接。