Training set sampling methods are used to improve model performance and lower data costs in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy system (the Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Our numerical results indicate superior data efficiency and model robustness when using GGFPS compared to FPS and uniform random sampling (URS), as well as established supervised FPS-style selectors, PCov-FPS and PCov-CUR. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to twofold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinski-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.
翻译:训练集采样方法被用于改进机器学习模型性能并降低与化学相关问题中的数据成本。我们提出了梯度引导的最远点采样(GGFPS),这是最远点采样(FPS)的一种简单扩展,利用分子力范数引导分子构型空间的高效采样。针对玩具系统(Styblinski-Tang函数)以及MD17数据集中的分子动力学轨迹,本文提供了数值证据。数值结果表明,与FPS、均匀随机采样(URS)以及已有的监督式FPS型选择器PCov-FPS和PCov-CUR相比,使用GGFPS具有更优的数据效率和模型稳健性。对MD17数据的分布分析表明,FPS系统地欠采样平衡几何构型,导致弛豫结构出现较大测试误差。GGFPS纠正了这一缺陷,并且:(i)在二维Styblinski-Tang系统中,与FPS相比,可在不牺牲预测精度的前提下将训练成本降低至两倍;(ii)系统地降低MD17中平衡结构与应变结构的预测误差;(iii)系统地降低所有MD17构型空间的预测误差方差。这些结果表明,梯度感知采样方法作为有效的训练集选择工具具有巨大潜力,而单纯使用FPS可能导致训练不平衡和预测结果不一致。