FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Today's large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and high-ratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warp-level optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2X over cuSZ and an average speedup of 37.0X over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3X and an average compression ratio improvement of 2.0X over cuZFP under the same data distortion.

翻译：当今运行在高性能计算（HPC）系统上的大规模科学应用会产生海量数据。因此，数据压缩正成为减轻存储负担和数据迁移成本的关键技术。然而，现有的科学数据有损压缩器无法同时实现高压缩比和高吞吐量，这阻碍了它们在需要快速压缩（如内存压缩）的诸多场景中的应用。为此，本文开发了一种基于GPU的快速高比率误差有界有损压缩器（称为FZ-GPU）。具体而言，我们首先设计了一种新的压缩流水线，包含完全并行的量化、位混洗以及新设计的快速编码模块；随后，针对流水线中的每个内核，提出了一系列深度架构优化方法以充分利用CUDA架构。我们提出了一种线程束级优化策略，用于避免位混洗中位运算的数据冲突、最大化共享内存利用率，并通过融合不同压缩内核消除不必要的数据迁移。最后，我们使用SDRBench中六个代表性科学数据集，在两款NVIDIA GPU（即A100和RTX A4000）上评估了FZ-GPU的性能。在A100 GPU上的结果表明，在相同误差界条件下，FZ-GPU的平均压缩速度比cuSZ快4.2倍，比基于多线程CPU的相同算法实现快37.0倍。在相同数据失真条件下，FZ-GPU的平均压缩速度比cuZFP快2.3倍，平均压缩比提升2.0倍。