Compressing integer keys is a fundamental operation among multiple communities, such as database management (DB), information retrieval (IR), and high-performance computing (HPC). Recent advances in \emph{learned indexes} have inspired the development of \emph{learned compressors}, which leverage simple yet compact machine learning (ML) models to compress large-scale sorted keys. The core idea behind learned compressors is to \emph{losslessly} encode sorted keys by approximating them with \emph{error-bounded} ML models (e.g., piecewise linear functions) and using a \emph{residual array} to guarantee accurate key reconstruction. While the concept of learned compressors remains in its early stages of exploration, our benchmark results demonstrate that an SIMD-optimized learned compressor can significantly outperform state-of-the-art CPU-based compressors. Drawing on our preliminary experiments, this vision paper explores the potential of learned data compression to enhance critical areas in DBMS and related domains. Furthermore, we outline the key technical challenges that existing systems must address when integrating this emerging methodology.
翻译:整数键压缩是数据库管理(DB)、信息检索(IR)和高性能计算(HPC)等多个领域的一项基础操作。近期\emph{学习型索引}的进展推动了\emph{学习型压缩器}的发展,其利用简单而紧凑的机器学习(ML)模型来压缩大规模有序键。学习型压缩器的核心思想是通过\emph{误差有界}的ML模型(例如分段线性函数)近似有序键,并借助\emph{残差数组}保证键值的精确重构,从而实现\emph{无损}编码。尽管学习型压缩器的概念仍处于探索的早期阶段,我们的基准测试结果表明,经SIMD优化的学习型压缩器能够显著超越最先进的基于CPU的压缩器。基于初步实验,本愿景论文探讨了学习型数据压缩在提升数据库管理系统及相关领域关键环节的潜力。此外,我们概述了现有系统在集成这一新兴方法时必须解决的关键技术挑战。