Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 5.2x speed up in a data analytical query in the Arrow columnar execution engine and a 16% increase in RocksDB's throughput.
翻译:轻量级数据压缩是一项关键技术,它使列式存储能够在分析查询中展现出卓越的性能。尽管已有大量研究探讨基于字典的编码方法以接近香农熵,但少有前期工作系统性地利用列中的序列相关性进行压缩。本文提出LeCo(即学习型压缩)框架,该框架利用机器学习自动消除值序列中的序列冗余,从而同时实现出色的压缩比与解压性能。LeCo为此提供了一种通用方法,使现有的特定算法(如参考帧编码、增量编码和游程编码)成为本框架的特例。我们在三个合成数据集和六个真实世界数据集上的微观基准测试表明,LeCo原型在压缩比和随机访问速度上均实现了对现有解决方案的帕累托改进。当将LeCo集成到广泛使用的应用程序中时,我们在Arrow列式执行引擎中观察到分析查询速度提升高达5.2倍,在RocksDB中吞吐量提升16%。