Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Jianbo Dong,Bin Luo,Jun Zhang,Pengcheng Zhang,Fei Feng,Yikai Zhu,Ang Liu,Zian Chen,Yi Shi,Hairong Jiao,Gang Lu,Yu Guan,Ennan Zhai,Wencong Xiao,Hanyu Zhao,Man Yuan,Siran Yang,Xiang Li,Jiamang Wang,Rui Men,Jianwei Zhang,Huang Zhong,Dennis Cai,Yuan Xie,Binzhang Fu

The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

翻译：大型语言模型（LLMs）的出现使得并行训练技术成为必要，该技术通常需要部署数千个GPU来训练单一模型。然而，我们发现当前并行训练的效率往往不尽如人意，这主要归因于以下两个关键问题。首先，硬件故障不可避免，导致训练任务中断。若无法快速定位故障组件，将造成GPU资源的严重浪费。其次，由于GPU必须等待参数同步完成后才能进行下一轮计算，网络拥塞会显著增加GPU的等待时间。为应对这些挑战，本文提出了一种通信驱动的解决方案，即C4。C4的核心思想包含两个方面。第一，在并行训练中，集合通信呈现出周期性且同质的特征，因此任何异常必然源于某种硬件故障。利用这一特性，C4能够快速识别故障组件，迅速隔离异常并重启任务，从而避免因异常检测延迟导致的资源浪费。第二，集合通信的可预测通信模式（涉及少量大流量数据流）使得C4能够高效执行流量规划，大幅减轻网络拥塞。C4已在我们的大规模生产系统中广泛部署，将故障引起的开销降低了约30%，并为部分通信开销适中的应用提升了约15%的运行性能。