Large-scale LLM training requires collective communication libraries to exchange data among distributed GPUs. As a company dedicated to building and operating large-scale GPU training clusters, we encounter several practical limitations of NCCL in production, including 1) SM competition between computation and communication, 2) expensive restart costs under link failures, and 3) insufficient observability of transient collective communication anomalies. To address these challenges, we propose VCCL, an efficient, reliable, and observable collective communication library in large-scale GPU training clusters. VCCL removes SM-consuming P2P kernels by moving intra-node data movement and stream dependency enforcement to CPU threads and GPU copy engines. VCCL also introduces a primary-backup QP mechanism to tolerate frequent NIC port failures, and designs a window-based monitor to observe network anomalies at O(μs) level. We opensource VCCL and deploy it in production training clusters for several months. Compared with NCCL, VCCL improves training throughput by up to 5.28% and reduces massive GPU resource wastage through runtime fault tolerance and finegrained monitor. We also share experience and lessons we learned during the deployment of VCCL in large-scale clusters.
翻译:大规模LLM训练需要通过集合通信库在分布式GPU之间交换数据。作为一家专注于构建和运营大规模GPU训练集群的公司,我们在生产环境中发现NCCL存在若干实践限制:1)计算与通信之间的SM竞争;2)链路故障下高昂的重启成本;3)瞬态集合通信异常的可观测性不足。为应对这些挑战,我们提出VCCL——面向大规模GPU训练集群的高效、可靠且可观测的集合通信库。VCCL通过将节点内数据移动和流依赖强制执行迁移至CPU线程与GPU拷贝引擎,移除了消耗SM资源的P2P内核。同时引入主备QP机制以容忍频繁的NIC端口故障,并设计基于窗口的监视器,在微秒级实现网络异常观测。我们将VCCL开源并部署于生产训练集群数月。与NCCL相比,VCCL训练吞吐量提升达5.28%,并通过运行时容错与细粒度监视器减少大量GPU资源浪费。文中还分享了在大规模集群中部署VCCL的经验与教训。