Lossy Compression of Scientific Data: Applications Constrains and Requirements

Franck Cappello,Allison Baker,Ebru Bozda,Martin Burtscher,Kyle Chard,Sheng Di,Paul Christopher O Grady,Peng Jiang,Shaomeng Li,Erik Lindahl,Peter Lindstrom,Magnus Lundborg,Kai Zhao,Xin Liang,Masaru Nagaso,Kento Sato,Amarjit Singh,Seung Woo Son,Dingwen Tao,Jiannan Tian,Robert Underwood,Kazutomo Yoshii,Danylo Lykov,Yuri Alexeev,Kyle Gerard Felker

from arxiv, 33 pages

Increasing data volumes from scientific simulations and instruments (supercomputers, accelerators, telescopes) often exceed network, storage, and analysis capabilities. The scientific community's response to this challenge is scientific data reduction. Reduction can take many forms, such as triggering, sampling, filtering, quantization, and dimensionality reduction. This report focuses on a specific technique: lossy compression. Lossy compression retains all data points, leveraging correlations and controlled reduced accuracy. Quality constraints, especially for quantities of interest, are crucial for preserving scientific discoveries. User requirements also include compression ratio and speed. While many papers have been published on lossy compression techniques and reference datasets are shared by the community, there is a lack of detailed specifications of application needs that can guide lossy compression researchers and developers. This report fills this gap by reporting on the requirements and constraints of nine scientific applications covering a large spectrum of domains (climate, combustion, cosmology, fusion, light sources, molecular dynamics, quantum circuit simulation, seismology, and system logs). The report also details key lossy compression technologies (SZ, ZFP, MGARD, LC, SPERR, DCTZ, TEZip, LibPressio), discussing their history, principles, error control, hardware support, features, and impact. By presenting both application needs and compression technologies, the report aims to inspire new research to fill existing gaps.

翻译：科学模拟与仪器（超级计算机、加速器、望远镜）产生的数据量日益增长，常常超出网络、存储与分析能力。科学界对此挑战的应对策略是科学数据缩减。缩减可采取多种形式，如触发采集、采样、滤波、量化和降维。本报告聚焦于一项特定技术：有损压缩。有损压缩保留所有数据点，通过利用数据相关性并控制精度降低来实现压缩。质量约束（特别是针对关键物理量）对保障科学发现至关重要。用户需求还包括压缩比与压缩速度。尽管已有大量论文探讨有损压缩技术，且学术界共享参考数据集，但缺乏能够指导有损压缩研究者与开发者的应用需求详细规范。本报告通过阐述涵盖广泛领域（气候、燃烧、宇宙学、聚变、光源、分子动力学、量子电路模拟、地震学、系统日志）的九类科学应用的需求与约束，填补了这一空白。报告还详细解析了关键有损压缩技术（SZ、ZFP、MGARD、LC、SPERR、DCTZ、TEZip、LibPressio），探讨其发展历程、原理、误差控制、硬件支持、特性与影响。通过同时呈现应用需求与压缩技术，本报告旨在启发新的研究以填补现有空白。