HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$\unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$\unicode{x2014}$demonstrates an average 17$\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.
翻译:HiCCL(层次化集合通信库)旨在应对高性能网络架构日益增长的复杂性与多样性。随着GPU系统已演变为具有不同多级通信层次的GPU网络,为特定系统优化每个集合函数已成为一项具有挑战性的任务。因此,许多集合通信库难以适应不同的硬件和软件环境,尤其是在不同厂商的系统之间。HiCCL的库设计通过组合式API将集合通信逻辑与网络特定优化解耦。通信逻辑通过组播、归约和栅障原语进行组合,随后仅使用层级内的点对点操作针对指定网络层次进行分解。最后,按需应用条带化与流水线优化以提升执行效率。在四台不同机器(两台配备英伟达GPU、一台配备AMD GPU、一台配备英特尔GPU)上对HiCCL的性能评估表明:其平均吞吐量比高度专用的GPU感知MPI实现中的集合操作高17倍,同时与厂商专用库(NCCL、RCCL和OneCCL)的吞吐量相当,并在所有四台机器上保持了可移植性。