Collective communication is becoming increasingly important in data center and supercomputer workloads with an increase in distributed AI related jobs. However, existing libraries that provide collective support such as NCCL, RCCL, and Cray-MPICH exhibit several performance and scalability limitations on modern GPU supercomputers. To address these challenges, we introduce the Performant Collective Communication Library (PCCL), specifically targeted for distributed deep learning (DL) workloads. PCCL provides highly optimized implementations of key collectives used in distributed DL: all-gather, reduce-scatter, and all-reduce. PCCL uses a hierarchical design with learning-based adaptive selection of the best performing algorithms to scale efficiently to thousands of GPUs. It achieves substantial performance speedups over RCCL on 2048 GCDs of Frontier -- up to 168x for reduce-scatter, 33x for all-gather and 10x for all-reduce. More modest but still significant gains up to 5.7x over NCCL are observed on Perlmutter. These gains translate directly to performance improvement of production DL workloads: up to 4.9x speedup over RCCL in DeepSpeed ZeRO-3 training, and up to 2.4x speedup in DDP training.
翻译:随着分布式人工智能相关任务的增加,集合通信在数据中心和超级计算机工作负载中变得日益重要。然而,现有的提供集合通信支持的库(如NCCL、RCCL和Cray-MPICH)在现代GPU超级计算机上表现出若干性能和可扩展性限制。为应对这些挑战,我们推出了高性能集合通信库(PCCL),专门针对分布式深度学习工作负载。PCCL为分布式深度学习中使用的关键集合操作(all-gather、reduce-scatter和all-reduce)提供了高度优化的实现。PCCL采用分层设计,并基于学习自适应选择性能最佳的算法,从而能够高效扩展至数千个GPU。在Frontier超级计算机的2048个GCD上,PCCL相比RCCL实现了显著的性能加速:reduce-scatter最高达168倍,all-gather最高达33倍,all-reduce最高达10倍。在Perlmutter系统上,相比NCCL也观察到了虽较温和但仍显著的性能提升,最高达5.7倍。这些性能增益直接转化为生产级深度学习工作负载的性能提升:在DeepSpeed ZeRO-3训练中相比RCCL最高加速4.9倍,在DDP训练中最高加速2.4倍。