Scientific datasets present unique challenges for machine learning-driven compression methods, including more stringent requirements on accuracy and mitigation of potential invalidating artifacts. Drawing on results from compressed sensing and rate-distortion theory, we introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L^1$-regularized to obtain sparse low dimensional representations. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data. We demonstrate our methods for short angle scattering (SAS) datasets showing they can achieve compression ratios around two orders of magnitude and in some cases better. Our compression methods show promise for use in addressing current bottlenecks in transmission, storage, and analysis in high-performance distributed computing environments. This is central to processing the large volume of SAS data being generated at shared experimental facilities around the world to support scientific investigations. Our approaches provide general ways for obtaining specialized compression methods for targeted scientific datasets.
翻译:科学数据集对机器学习驱动的压缩方法提出了独特的挑战,包括对精度的更严格要求以及对可能无效伪影的抑制。借鉴压缩感知和率失真理论的研究成果,我们通过开发使用高维潜在空间的自编码器,引入有效的数据压缩方法;这些潜在空间经过$L^1$正则化以获得稀疏的低维表示。我们展示了如何利用这些信息丰富的潜在空间来减轻模糊效应及其他伪影,从而为科学数据获得高效的压缩方法。我们在小角散射(SAS)数据集上验证了我们的方法,结果表明其可实现约两个数量级、在某些情况下更优的压缩比。我们的压缩方法有望用于解决当前高性能分布式计算环境中数据传输、存储和分析的瓶颈问题。这对于处理全球共享实验设施所产生的大量SAS数据以支持科学研究至关重要。我们的方法为针对特定科学数据集获取专用压缩方案提供了通用途径。