GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
翻译:随着GPU计算能力的快速提升,GPU感知的集合通信已成为现代计算平台的主要瓶颈。传统方法直接将有损压缩集成到GPU感知的集合通信中,可能导致GPU设备利用率低下和数据失真失控等严重性能问题。为解决这些问题,本文首次提出通用框架gZCCL,该框架设计并优化了GPU感知的、支持压缩的集合通信,并采用精度感知设计来控制误差传播。为验证框架性能,我们使用真实应用和数据集在多达512个NVIDIA A100 GPU上进行了评估。实验结果表明,我们经gZCCL加速的集合通信(包括集合计算Allreduce和集合数据移动Scatter)的性能分别比NCCL和Cray MPI最高提升4.5倍和28.7倍。此外,基于图像叠加应用的精度评估证实,我们的精度感知框架能够保证重构数据的高质量。