GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

Today's graphics processing unit (GPU) applications produce vast volumes of data, which are challenging to store and transfer efficiently. Thus, data compression is becoming a critical technique to mitigate the storage burden and communication cost. LZSS is the core algorithm in many widely used compressors, such as Deflate. However, existing GPU-based LZSS compressors suffer from low throughput due to the sequential nature of the LZSS algorithm. Moreover, many GPU applications produce multi-byte data (e.g., int16/int32 index, floating-point numbers), while the current LZSS compression only takes single-byte data as input. To this end, in this work, we propose GPULZ, a highly efficient LZSS compression on modern GPUs for multi-byte data. The contribution of our work is fourfold: First, we perform an in-depth analysis of existing LZ compressors for GPUs and investigate their main issues. Then, we propose two main algorithm-level optimizations. Specifically, we (1) change prefix sum from one pass to two passes and fuse multiple kernels to reduce data movement between shared memory and global memory, and (2) optimize existing pattern-matching approach for multi-byte symbols to reduce computation complexity and explore longer repeated patterns. Third, we perform architectural performance optimizations, such as maximizing shared memory utilization by adapting data partitions to different GPU architectures. Finally, we evaluate GPULZ on six datasets of various types with NVIDIA A100 and A4000 GPUs. Results show that GPULZ achieves up to 272.1X speedup on A4000 and up to 1.4X higher compression ratio compared to state-of-the-art solutions.

翻译：当今图形处理单元（GPU）应用产生的海量数据给高效存储与传输带来了挑战。因此，数据压缩成为缓解存储负担和通信开销的关键技术。作为Deflate等众多广泛使用的压缩器的核心算法，LZSS在GPU上的现有实现却因算法固有的顺序性特性而面临吞吐率低下的问题。此外，大量GPU应用生成多字节数据（如int16/int32索引、浮点数），而现有LZSS压缩仅支持单字节数据输入。为此，本文提出GPULZ——一种面向现代GPU的高效多字节数据LZSS压缩方案。我们工作的贡献体现在四个方面：首先，深入分析现有GPU LZ压缩器，探究其主要问题；其次，提出两项关键算法级优化：（1）将前缀和计算从单趟改为双趟，并通过融合多个内核以减少共享内存与全局内存间的数据移动，（2）针对多字节符号优化现有模式匹配方法，降低计算复杂度并探索更长的重复模式；第三，执行架构级性能优化，例如通过根据不同的GPU架构调整数据分区以最大化共享内存利用率；最后，在NVIDIA A100和A4000 GPU上对六种不同类型的数据集进行评估。结果显示，相较于现有最优方案，GPULZ在A4000上实现了最高272.1倍的加速比，压缩率提升最高达1.4倍。