Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no theoretical or data-based quantification of lossy compressibility exists for scientific data. Users rely on trial and error to assess lossy compression performance. As a strong data-driven effort toward quantifying lossy compressibility of scientific datasets, we provide a statistical framework to predict compression ratios of lossy compressors. Our method is a two-step framework where (i) compressor-agnostic predictors are computed and (ii) statistical prediction models relying on these predictors are trained on observed compression ratios. Proposed predictors exploit spatial correlations and notions of entropy and lossyness via the quantized entropy. We study 8+ compressors on 6 scientific datasets and achieve a median percentage prediction error less than 12%, which is substantially smaller than that of other methods while achieving at least a 8.8x speedup for searching for a specific compression ratio and 7.8x speedup for determining the best compressor out of a collection.
翻译:有损压缩器正越来越多地被应用于科学研究中,以处理实验或并行数值模拟产生的大量数据,并促进数据的存储和传输。与无损压缩中的熵概念不同,科学数据的有损可压缩性缺乏理论或基于数据的量化方法。用户只能通过反复试验来评估有损压缩性能。作为量化科学数据集有损可压缩性的一项有力数据驱动尝试,我们提出了一种统计框架来预测有损压缩器的压缩比。该方法是一个两步框架:(i) 计算与压缩器无关的预测因子;(ii) 基于这些预测因子的统计预测模型在观测到的压缩比上训练。所提出的预测因子利用空间相关性以及通过量化熵定义的熵和有损性概念。我们在6个科学数据集上研究了8个以上的压缩器,实现了中位数百分比预测误差低于12%,这显著小于其他方法,同时在搜索特定压缩比时实现了至少8.8倍的加速,在确定压缩器集合中的最佳压缩器时实现了7.8倍的加速。