Machine-generated data is rapidly growing and poses challenges for data-intensive systems, especially as the growth of data outpaces the growth of storage space. To cope with the storage issue, compression plays a critical role in storage engines, particularly for data-intensive applications, where high compression ratios and efficient random access are essential. However, existing compression techniques tend to focus on general-purpose and data block approaches, but overlook the inherent structure of machine-generated data and hence result in low-compression ratios or limited lookup efficiency. To address these limitations, we introduce the Pattern-Based Compression (PBC) algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. Our experimental evaluation demonstrates that PBC, on average, achieves a compression ratio twice as high as state-of-the-art techniques while maintaining competitive compression and decompression speeds.We also integrate PBC to a production database system and achieve improvement on both comparison ratio and throughput.
翻译:机器生成数据正快速增长,给数据密集型系统带来了挑战,尤其是当数据增长超过存储空间增长时。为应对存储问题,压缩在存储引擎中扮演着关键角色,特别是在数据密集型应用中,高压缩比和高效随机访问至关重要。然而,现有压缩技术通常侧重于通用和数据块方法,忽视了机器生成数据的内在结构,导致压缩比低或查找效率有限。为解决这些局限,我们引入了基于模式的压缩(PBC)算法,该算法专门针对机器生成数据中的模式,在大多数情况下实现帕累托最优。与传统基于数据块的方法不同,PBC以每条记录为单位进行压缩,便于快速随机访问。我们的实验评估表明,PBC平均压缩比达到现有技术的两倍,同时保持竞争性的压缩和解压缩速度。我们还将PBC集成到生产数据库系统中,在压缩比和吞吐量方面均实现了改进。