GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
翻译:随着GPU计算能力的快速提升,GPU感知的集合通信已成为现代计算平台的主要瓶颈。针对这一问题,传统方法将有损压缩直接集成到GPU感知的集合通信中,但仍存在GPU设备利用率不足、数据失真不可控等严重问题。本文提出通用框架gZCCL,通过设计优化GPU感知的可压缩集合通信,并引入精度感知机制以控制误差传播。为验证框架有效性,我们在多达512块NVIDIA A100 GPU上使用真实应用和数据集进行性能评估。实验结果表明,经gZCCL加速的集合通信算子(包括集合计算Allreduce和集合数据移动Scatter),相较于NCCL和Cray MPI分别可获得最高4.5倍和28.7倍的性能提升。此外,基于图像叠加应用的精度评估证实,我们提出的精度感知框架可实现高质量的数据重建。