Today's large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and high-ratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warp-level optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2X over cuSZ and an average speedup of 37.0X over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3X and an average compression ratio improvement of 2.0X over cuZFP under the same data distortion.
翻译:当今运行在高性能计算(HPC)系统上的大规模科学应用会产生海量数据。因此,数据压缩正成为减轻存储负担和数据迁移成本的关键技术。然而,现有的科学数据有损压缩器无法同时实现高压缩比和高吞吐量,这阻碍了它们在需要快速压缩(如内存压缩)的诸多场景中的应用。为此,本文开发了一种基于GPU的快速高比率误差有界有损压缩器(称为FZ-GPU)。具体而言,我们首先设计了一种新的压缩流水线,包含完全并行的量化、位混洗以及新设计的快速编码模块;随后,针对流水线中的每个内核,提出了一系列深度架构优化方法以充分利用CUDA架构。我们提出了一种线程束级优化策略,用于避免位混洗中位运算的数据冲突、最大化共享内存利用率,并通过融合不同压缩内核消除不必要的数据迁移。最后,我们使用SDRBench中六个代表性科学数据集,在两款NVIDIA GPU(即A100和RTX A4000)上评估了FZ-GPU的性能。在A100 GPU上的结果表明,在相同误差界条件下,FZ-GPU的平均压缩速度比cuSZ快4.2倍,比基于多线程CPU的相同算法实现快37.0倍。在相同数据失真条件下,FZ-GPU的平均压缩速度比cuZFP快2.3倍,平均压缩比提升2.0倍。