Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a prediction-residual paradigm, where compression effectiveness depends on the quality of the predictor: more accurate predictions generate smaller residuals that are easier to compress. This observation raises a question: can modern machine learning models serve as superior predictors for scientific data compression? Answering this question directly is challenging because developing compression-specific ML predictors requires substantial resources. Instead, we leverage the climate domain where highly accurate pretrained weather forecasting foundation models already exist, making them an ideal testbed. We present a framework that integrates spatial and temporal deep learning models into a conventional error-bounded compression pipeline. The framework supports auto-regressive forecasting models and avoids error accumulation. Using ERA5 climate data as a representative large-scale scientific dataset, we evaluate three distinct ML predictors: a VAEformer-based codec (CRA5), a graph neural network forecaster (GraphCast), and a vision-transformer forecaster (Aurora), against the state-of-the-art compressor SZ3.1 under identical quantization and entropy-coding backends. Our evaluation over approximately 1.7 TB of data reveals a surprising result: although ML predictors generate more accurate predictions and can improve reconstruction quality by up to 91% while achieving up to 9.6x higher compression ratios for highly predictable variables, they do not improve overall dataset-level compression ratio. We show that prediction accuracy alone is insufficient: the spatial structure of the resulting residuals plays a decisive role in entropy coding efficiency.
翻译:有损压缩技术是处理现代模拟与观测仪器产生迅猛增长的科学数据的关键手段。当前主流压缩器大多遵循"预测-残差"范式,其压缩效能直接取决于预测器的精度:预测越精确,产生的残差越小,后续压缩难度越低。这一发现引发了一个问题:现代机器学习模型能否成为科学数据压缩的优质预测器?直接回答该问题颇具挑战性,因为开发面向压缩的机器学习专用预测器需要大量资源。为此,我们借助气候科学领域已存在的高精度预训练天气预测基础模型,将其作为理想测试平台。我们提出了一种框架,将空间与时间深度学习模型整合至传统有界误差压缩流水线中。该框架支持自回归预测模型并避免误差累积。以ERA5气候数据作为典型大规模科学数据集,我们在相同量化与熵编码后端条件下,评估了三种不同的机器学习预测器:基于VAEformer的编解码器(CRA5)、图神经网络预测器(GraphCast)和视觉变换器预测器(Aurora),并与当前最优压缩器SZ3.1进行对比。通过对约1.7TB数据的评估,我们得出了一个令人意外的结论:尽管机器学习预测器能生成更精确的预测,可将重建质量提升高达91%,并对高可预测变量实现最高9.6倍的压缩比提升,但并未提升整体数据集的压缩比。研究表明,仅凭预测精度并不足以优化压缩效果:残差的空间结构对熵编码效率起着决定性作用。