There are a prohibitively large number of floating-point time series data generated at an unprecedentedly high rate. An efficient, compact and lossless compression for time series data is of great importance for a wide range of scenarios. Most existing lossless floating-point compression methods are based on the XOR operation, but they do not fully exploit the trailing zeros, which usually results in an unsatisfactory compression ratio. This paper proposes an Erasing-based Lossless Floating-point compression algorithm, i.e., Elf. The main idea of Elf is to erase the last few bits (i.e., set them to zero) of floating-point values, so the XORed values are supposed to contain many trailing zeros. The challenges of the erasing-based method are three-fold. First, how to quickly determine the erased bits? Second, how to losslessly recover the original data from the erased ones? Third, how to compactly encode the erased data? Through rigorous mathematical analysis, Elf can directly determine the erased bits and restore the original values without losing any precision. To further improve the compression ratio, we propose a novel encoding strategy for the XORed values with many trailing zeros. Furthermore, observing the values in a time series usually have similar significand counts, we propose an upgraded version of Elf named Elf+ by optimizing the significand count encoding strategy, which improves the compression ratio and reduces the running time further. Both Elf and Elf+ work in a streaming fashion. They take only O(N) (where N is the length of a time series) in time and O(1) in space, and achieve a notable compression ratio with a theoretical guarantee. Extensive experiments using 22 datasets show the powerful performance of Elf and Elf+ compared with 9 advanced competitors for both double-precision and single-precision floating-point values.
翻译:存在大量浮点时间序列数据以前所未有的高速率生成。针对时间序列数据的高效、紧凑且无损的压缩方法在众多应用场景中至关重要。现有大多数浮点无损压缩方法基于XOR操作,但未能充分利用尾部零位,通常导致压缩比不理想。本文提出了一种基于擦除的无损浮点压缩算法Elf。Elf的核心思想是将浮点值的最后若干位擦除(即设置为零),使得XOR后的值包含大量尾部零位。基于擦除的方法面临三重挑战:第一,如何快速确定擦除位数?第二,如何从擦除数据中无损恢复原始数据?第三,如何紧凑地编码擦除后的数据?通过严格的数学分析,Elf能够直接确定擦除位数并恢复原始值而不损失任何精度。为进一步提升压缩比,我们针对包含大量尾部零位的XOR值提出了一种新颖的编码策略。此外,鉴于时间序列中的数值通常具有相似的尾数位数,我们通过优化尾数位数编码策略提出了Elf的升级版本Elf+,该版本进一步提升了压缩比并缩短了运行时间。Elf和Elf+均以流式方式运行,时间复杂度为O(N)(N为时间序列长度),空间复杂度为O(1),并在理论保证下实现了显著的压缩比。基于22个数据集的广泛实验表明,与9种先进竞赛算法相比,Elf和Elf+在双精度和单精度浮点值上均表现出强大的性能。