Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of data points and their dimensionality. Whereas dimension reduction typically aims at describing each data sample on lower-dimensional space, the focus here is on reducing the number of data points. A strategy is proposed to select data points such that they uniformly span the phase-space of the data. The algorithm proposed relies on estimating the probability map of the data and using it to construct an acceptance probability. An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map. Instead of binning the phase-space to estimate the probability map, its functional form is approximated with a normalizing flow. Therefore, the method naturally extends to high-dimensional datasets. The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data is available. An implementation of the method is available in a companion repository (https://github.com/NREL/Phase-space-sampling).
翻译:计算与实验能力的提升正快速增加常规生成的科学数据量。在受限于内存与计算强度的应用中,过大规模的数据集可能阻碍科学发现,因此数据约简成为数据驱动方法的关键组成部分。数据集在两方面持续增长:数据点数量及其维度。虽然降维通常旨在用低维空间描述每个数据样本,但本文聚焦于减少数据点数量。本文提出一种策略,通过选取数据点使其均匀覆盖数据的相空间。该算法基于估计数据的概率分布图并据此构建接受概率。当仅使用数据集的小部分构建概率分布图时,采用迭代方法准确估计稀有数据点的概率。为替代对相空间进行分箱以估计概率分布图的做法,本文使用归一化流近似其函数形式,因此该方法可自然扩展至高维数据集。研究表明,所提框架在数据充裕时能够实现数据高效的机器学习。该方法的实现代码已发布于配套仓库(https://github.com/NREL/Phase-space-sampling)。