CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

Jeongmin Park,Zaid Qureshi,Vikram Mailthody,Andrew Gacek,Shunfan Shao,Mohammad AlMasri,Isaac Gelado,Jinjun Xiong,Chris Newburn,I-hsin Chung,Michael Garland,Nikolay Sakharnykh,Wen-mei Hwu

Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS.

翻译：数据压缩与解压缩已成为大数据应用中应对数据量指数级增长的关键技术。同时，由于高计算吞吐量与高内存带宽的优势，GPU在大数据应用中的采用日益广泛。现有研究通常假设解压缩受内存带宽限制，将大部分GPU线程用于数据搬运，并采用复杂的软件技术来隐藏读取压缩数据和写入解压缩数据时的内存延迟。本文表明，这些技术会导致GPU资源利用率低下——多数线程最终需等待少数解码线程，从而暴露出计算与同步延迟。基于这一观察，我们提出CODAG，一种用于GPU高吞吐量解压缩的新型简洁内核架构。CODAG消除了专用线程组的使用，释放计算资源以增加并行解压缩流数量，并利用充足的计算活动与GPU硬件调度器来容忍同步、计算与内存延迟。此外，CODAG为用户提供框架，使其无需实现复杂的延迟隐藏优化即可轻松集成新解压缩算法。我们通过三种不同编码技术（RLE v1、RLE v2和Deflate）及来自不同领域的大规模数据集验证了所提架构。实验表明，与NVIDIA RAPIDS中的最先进解压缩器相比，CODAG对RLE v1、RLE v2和Deflate的加速比分别达到13.46倍、5.69倍和1.18倍。