Error-controlled lossy compressors have been widely used in scientific applications to reduce the unprecedented size of scientific data while keeping data distortion within a user-specified threshold. While they significantly mitigate the pressure for data storage and transmission, they prolong the time to access the data because decompression is required to transform the binary compressed data into meaningful floating-point numbers. This incurs noticeable overhead for common analytical operations on scientific data that extract or derive useful information, because the time cost of the operations could be much lower than that of decompression. In this work, we design an error-controlled lossy compression and analytical framework that features multi-stage decompression and homomorphic analytical operation algorithms on intermediate decompressed data for reduced data access latency. Our contributions are threefold. (1) We abstract a generic compression pipeline with partial decompression to multiple intermediate data representations and implement four instances based on state-of-the-art high-throughput scientific data compressors. (2) We carefully design homomorphic algorithms to enable direct operations on intermediate decompressed data for three types of analytical operations on scientific data. (3) We evaluate our approach using five real-world scientific datasets. Experimental evaluations demonstrate that our method achieves significant speedups when performing analytical operations on compressed scientific data across all three targeted analytical operation types.
翻译:误差可控的有损压缩器已广泛应用于科学应用中,以在保持数据失真在用户指定阈值内的前提下,减少科学数据前所未有的规模。虽然它们显著缓解了数据存储和传输的压力,但由于需要解压缩将二进制压缩数据转换为有意义的浮点数,从而延长了数据访问时间。这给常见的科学数据分析操作(提取或推导有用信息)带来了显著开销,因为操作的时间成本可能远低于解压缩。本研究设计了一种误差可控的有损压缩与分析框架,其特点在于多级解压缩以及对中间解压缩数据的同态分析操作算法,以降低数据访问延迟。我们的贡献包括三方面:(1)抽象出一个具有部分解压缩至多种中间数据表示的通用压缩管道,并基于最先进的高通量科学数据压缩器实现了四个实例;(2)精心设计了同态算法,实现对中间解压缩数据的直接操作,涵盖三种类型的科学数据分析操作;(3)使用五个真实科学数据集评估了该方法。实验评估表明,在所有三种目标分析操作类型上,该方法对压缩科学数据执行分析操作时均实现了显著加速。