大规模GPU训练集群中高效、可靠且可观测的集体通信库 (An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters)

Ziteng Chen,Xiaohe Hu,Menghao Zhang,Yanmin Jia,Yan Zhang,Mingjun Zhang,Da Liu,Fangzheng Jiao,Jun Chen,He Liu,Aohan Zeng,Shuaixing Duan,Ruya Gu,Yang Jing,Bowen Han,Jiahao Cao,Wei Chen,Wenqi Xie,Jinlong Hou,Yuan Cheng,Bohua Xu,Mingwei Xu,Chunming Hu

from arxiv, 15 pages, 16 figures

Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several challenges when using NCCL in production, including 1) limited efficiency with costly and cumbersome P2P communication, 2) poor tolerance to frequent RNIC port failures, and 3) insufficient observability of transient collective communication anomalies. To address these issues, we propose ICCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. ICCL offloads the P2P communication from GPU kernels to CPU threads for minimal SM consumption, and removes the redundant memory copies irrelevant to the actual communicating process. ICCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O(us) level. We open-source ICCL and deploy it in production training clusters for several months, with results showing that compared to NCCL, ICCL achieves a 23.4%/28.5% improvement in P2P throughput/latency as well as a 6.02% increase in training throughput. We also share the operating experience of ICCL in large-scale clusters, hoping to give the communities more insights on production-level collective communication libraries in LLM training.

翻译：大规模语言模型训练需要集体通信库在分布式GPU之间交换数据。作为一家致力于构建和运营大规模GPU训练集群的公司，我们在生产中使用NCCL时遇到了若干挑战，包括：1) 采用昂贵且繁琐的点对点通信时效率受限；2) 对频繁发生的RNIC端口故障容忍性差；3) 对瞬态集体通信异常的观测能力不足。为解决这些问题，我们提出了ICCL——一个适用于大规模GPU训练集群的高效、可靠且可观测的集体通信库。ICCL将点对点通信从GPU内核卸载到CPU线程，以最小化流多处理器消耗，并移除了与实际通信过程无关的冗余内存拷贝。ICCL还引入了主备队列对机制以容忍频繁的网卡端口故障，并设计了基于时间窗口的监控器以在微秒级别观测网络异常。我们开源了ICCL并将其部署在生产训练集群中数月，结果表明：与NCCL相比，ICCL在点对点吞吐量/延迟上实现了23.4%/28.5%的提升，训练吞吐量提高了6.02%。我们还分享了ICCL在大规模集群中的运维经验，希望能为业界提供更多关于LLM训练中生产级集体通信库的深入见解。