Scientific applications typically generate large volumes of floating-point data, making lossy compression one of the most effective methods for data reduction, thereby lowering storage requirements and improving performance in large-scale applications. However, variations in compression time can significantly impact overall performance improvement, due to inaccurate scheduling, workload imbalances, etc. Existing approaches rely on empirical methods to predict the compression performance, which often lack interpretability and suffer from limitations in accuracy and generalizability. In this paper, we propose surrogate models for predicting the compression time of prediction-based lossy compression and provide a detailed analysis of the factors influencing time variability with uncertainty analysis. Our evaluation shows that our solution can accuratly predict the compression time with 5% average error across six scientific datasets. It also provides accurate 95% confidence interval, which is essential for time-sensitive scheduling and applications.
翻译:科学应用通常会产生大量浮点数据,使得有损压缩成为最有效的数据缩减方法之一,从而降低存储需求并提升大规模应用的性能。然而,压缩时间的变化会显著影响整体性能提升,这主要是由于调度不准确、工作负载不平衡等原因造成的。现有方法依赖经验性手段来预测压缩性能,这些方法往往缺乏可解释性,并且在准确性和泛化能力上存在局限。本文提出了用于预测基于预测的有损压缩时间的代理模型,并通过不确定性分析详细探讨了影响时间变异性的因素。评估结果表明,我们的方案能够在六个科学数据集上以平均5%的误差准确预测压缩时间,同时提供精确的95%置信区间,这对于时间敏感的调度和应用至关重要。