GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose GPULZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate GPULZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that GPULZ achieves up to 272.1X speedup on A4000 and up to 1.4X higher compression ratio compared to state-of-the-art solutions.

翻译：当今图形处理单元（GPU）应用会产生海量数据，这些数据在高效存储与传输方面面临严峻挑战。因此，数据压缩正成为缓解存储负担与通信成本的关键技术。LZSS算法是Deflate等众多广泛使用的压缩器的核心算法。然而，受限于LZSS算法的顺序特性，现有基于GPU的LZSS压缩器存在吞吐量低的问题。此外，许多GPU应用生成多字节数据（例如int16/int32索引、浮点数），而当前LZSS压缩仅以单字节数据作为输入。为此，本文提出GPULZ——一种在当代GPU上实现高效多字节数据LZSS压缩的方案。本工作贡献包含四方面：首先，深入分析现有GPU上LZ压缩器并探究其主要问题；其次，提出两项核心算法级优化：（1）将前缀和从单遍计算改为双遍计算并融合多个核函数，以减少共享内存与全局内存间的数据迁移，（2）优化面向多字节符号的现有模式匹配方法以降低计算复杂度并探索更长重复模式；第三，执行架构级性能优化，例如通过适配不同GPU架构的数据分区策略最大化共享内存利用率；最后，使用NVIDIA A100与A4000 GPU在六类不同数据集上评估GPULZ。结果表明，与现有最优方案相比，GPULZ在A4000上实现最高272.1倍加速比，压缩率提升最高达1.4倍。