Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 3.9x speed up in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput.
翻译:轻量级数据压缩是一项关键技术,使得列存储能够为分析型查询展现出卓越性能。尽管在基于字典的编码方法以接近香农熵方面已有全面研究,但此前很少有工作系统地利用列中的序列相关性进行压缩。本文提出LeCo(即学习型压缩)框架,该框架利用机器学习自动消除值序列中的序列冗余,从而同时实现出色的压缩比与解压性能。LeCo为此提供了一种通用方法,使得现有(特设)算法(如参考帧编码、增量编码和游程编码)成为我们框架下的特例。我们在三个合成数据集和六个真实世界数据集上的微基准测试表明,LeCo原型在压缩比和随机访问速度上均较现有解决方案实现了帕累托改进。当将LeCo集成到广泛使用的应用程序中时,我们观察到在Parquet文件的过滤扫描中速度提升高达3.9倍,Rocksdb的吞吐量增加16%。