Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 3.9x speed up in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput.
翻译:轻量级数据压缩是使列式存储能够在分析查询中展现卓越性能的关键技术。尽管已有大量研究基于字典编码逼近香农熵,但很少有工作系统地利用列中的序列相关性进行压缩。本文提出LeCo(即学习型压缩)框架,该框架通过机器学习自动消除值序列中的序列冗余,从而同时实现优异的压缩比与解压性能。LeCo为此提供了一种通用方法,使现有(专设)算法(如参考帧编码、增量编码和游程编码)成为我们框架中的特例。基于三个合成数据集和六个真实数据集的微基准测试表明,LeCo原型在压缩比与随机访问速度上均实现了对现有方案的帕累托改进。将LeCo集成到主流应用中后,我们对Parquet文件进行过滤扫描时速度提升高达3.9倍,Rocksdb的吞吐量增加16%。