A General Framework for Progressive Data Compression and Retrieval

In scientific simulations, observations, and experiments, the cost of transferring data to and from disk and across networks has become a significant bottleneck that particularly impacts subsequent data analysis and visualization. To address this challenge, compression techniques have been widely adopted. However, traditional lossy compression approaches often require setting error tolerances conservatively to respect the numerical sensitivities of a wide variety of post hoc data analyses, some of which may not even be known a priori. Progressive data compression and retrieval has emerged as a solution, allowing for the adaptive handling of compressed data according to the needs of a given post-processing task. However, few analysis algorithms natively support progressive data processing, and adapting compression techniques, file formats, client/server frameworks, and APIs to support progressivity can be challenging. This work presents a general framework that supports progressive-precision data queries independently of the underlying data compressor or number representation. Our approach is based on a multiple-component representation that successively, with each new component, reduces the error between the original and compressed field, allowing each field in the progressive sequence to be expressed as a partial sum of components. We have implemented our approach on top of four popular scientific data compressors and have evaluated its behavior on several real-world data sets from the SDRBench collection. Numerical results indicate that our framework is effective in terms of accuracy compared to each of the standalone compressors it builds upon. In addition, (de)compression time is proportional to the number and granularity of components. Finally, our framework allows for fully lossless compression using lossy compressors when a sufficient number of components are employed.

翻译：在科学模拟、观测和实验中，数据在磁盘与网络之间传输的成本已成为显著瓶颈，尤其影响后续的数据分析与可视化。为解决这一挑战，压缩技术已被广泛采用。然而，传统有损压缩方法通常需要保守地设置误差容限，以尊重多种后续数据分析的数值敏感性，其中部分分析需求甚至无法预先获知。渐进式数据压缩与检索作为一种解决方案应运而生，它允许根据特定后处理任务的需求自适应处理压缩数据。然而，目前很少有分析算法原生支持渐进式数据处理，且调整压缩技术、文件格式、客户端/服务器框架及应用程序接口以实现渐进性具有挑战性。本文提出一种通用框架，该框架独立于底层数据压缩器或数值表示形式，支持渐进精度的数据查询。我们的方法基于一种多分量表示，随着每个新分量的加入，逐步减小原始场与压缩场之间的误差，使得渐进序列中的每个字段可表示为分量的部分和。我们已在四种流行科学数据压缩器上实现该方法，并利用SDRBench数据集中的多个真实数据集评估其性能。数值结果表明，与所基于的独立压缩器相比，该框架在精度方面表现有效。此外，（解）压缩时间与分量的数量和粒度成正比。最后，当使用足够数量的分量时，本框架可通过有损压缩器实现完全无损压缩。