With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. Large message size in MPI collectives is a particularly big concern because it may significantly delay the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize an optimized version based on SZx, an ultra-fast error-bounded lossy compressor, which can meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 3.5-9.7X.
翻译:随着超级计算机计算能力的持续增强以及科学应用规模的不断增长,MPI集合通信的效率已成为大规模分布式与并行处理中的关键瓶颈。MPI集合操作中的大数据量消息尤为令人担忧,因为它可能显著拖慢整体并行性能。为解决此问题,现有研究简单地将现成的固定速率有损压缩器应用于MPI集合操作中,导致性能欠佳、泛化能力有限以及误差无界。本文提出一种名为C-Coll的创新解决方案,通过利用误差有界有损压缩大幅缩减消息体积,从而显著降低通信开销。其核心贡献包括三个方面:(1) 针对两类MPI集合操作(集合数据移动与集合计算)的特性,我们分别开发了两种通用的、基于优化有损压缩的框架。该框架不仅降低通信开销,还能保证数据精度。(2) 我们基于超高速误差有界有损压缩器SZx定制了优化版本,以满足集合通信的特定需求。(3) 我们将C-Coll集成至MPI_Allreduce、MPI_Scatter与MPI_Bcast等多种集合操作中,并基于真实科学数据集进行全面评估。实验表明,相较于原始MPI集合操作及多种基线与相关工作,本方案实现了3.5-9.7倍的性能提升。