Neural networks are used as generative surrogate models for scientific discovery, which are trainable approximations of scientific simulations. These models enable users to replace time-consuming numerical simulations with learned alternatives, providing quick solutions. However, high-fidelity generative surrogate models require massive training datasets, which can create storage and I/O challenges. Lossy compression is a promising way to reduce this burden, but compression errors may affect the model quality in subtle ways, making it challenging to quantify their impact. In this work, we examine how lossy compression of training data impacts the quality of generative surrogate models. We begin by characterizing the uncertainty inherent in training neural networks, showing that identical training configurations can produce different models. By exploiting this variability, we propose a method to estimate how much compression-induced error a surrogate model can tolerate without affecting its accuracy. Evaluation of two application simulations demonstrates that our approach significantly reduces memory/storage requirements and speeds up training while producing high-quality surrogate models. These results show that lossy compression saves data storage up to 23.7x and 39x with negligible impact on the quality of the surrogate model. Meanwhile, reducing the size of the training data set also enhances the data loading speed and reduces the training time by up to 3x.
翻译:神经网络被用作科学发现的生成式替代模型,这些模型是对科学模拟的可训练近似。它们使用户能够用学习到的替代方案取代耗时的数值模拟,从而提供快速解决方案。然而,高保真生成式替代模型需要大规模训练数据集,这可能导致存储和I/O方面的挑战。有损压缩是减轻这一负担的一种有前景的方法,但压缩误差可能以微妙的方式影响模型质量,使得量化其影响充满挑战。本研究探讨训练数据的有损压缩如何影响生成式替代模型的质量。我们首先刻画神经网络训练中固有的不确定性,表明相同的训练配置可能产生不同的模型。利用这种变异性,我们提出了一种方法,用于估计替代模型在不影响其精度的情况下所能容忍的压缩引入误差。对两个应用模拟的评估表明,我们的方法在生成高质量替代模型的同时,显著降低了内存/存储需求并加速了训练过程。这些结果显示,有损压缩可将数据存储节省高达23.7倍和39倍,而对替代模型质量的影响可忽略不计。同时,训练数据集规模的减小也提升了数据加载速度,并将训练时间缩短了高达3倍。