The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous GPUs, leading to inefficiency and higher costs. We present HetCCL, a collective communication library that unifies vendor-specific backends and enables RDMA-based communication across GPUs without requiring driver modifications. HetCCL introduces two novel mechanisms that enable cross-vendor communication while leveraging optimized vendor libraries, NVIDIA NCCL and AMD RCCL. Evaluations on a multi-vendor GPU cluster show that HetCCL matches NCCL and RCCL performance in homogeneous setups while uniquely scaling in heterogeneous environments, enabling practical, high-performance training with both NVIDIA and AMD GPUs without changes to existing deep learning applications.
翻译:大型语言模型的快速发展正推动各机构扩展其GPU集群,通常包含来自多个供应商的GPU。然而,当前的深度学习框架缺乏对异构GPU间集合通信的支持,导致效率低下和成本增加。本文提出HetCCL,一个集合通信库,它统一了各供应商特定的后端,并支持基于RDMA的GPU间通信,而无需修改驱动程序。HetCCL引入了两种新颖的机制,在利用优化的供应商库(NVIDIA NCCL和AMD RCCL)的同时,实现了跨供应商通信。在多供应商GPU集群上的评估表明,HetCCL在同构设置中与NCCL和RCCL性能相当,同时在异构环境中展现出独特的可扩展性,使得无需修改现有深度学习应用即可使用NVIDIA和AMD GPU进行实用、高性能的训练。