Data-driven, machine learning (ML) models of atomistic interactions are often based on flexible and non-physical functions that can relate nuanced aspects of atomic arrangements into predictions of energies and forces. As a result, these potentials are as good as the training data (usually results of so-called ab initio simulations) and we need to make sure that we have enough information for a model to become sufficiently accurate, reliable and transferable. The main challenge stems from the fact that descriptors of chemical environments are often sparse high-dimensional objects without a well-defined continuous metric. Therefore, it is rather unlikely that any ad hoc method of choosing training examples will be indiscriminate, and it will be easy to fall into the trap of confirmation bias, where the same narrow and biased sampling is used to generate train- and test- sets. We will demonstrate that classical concepts of statistical planning of experiments and optimal design can help to mitigate such problems at a relatively low computational cost. The key feature of the method we will investigate is that they allow us to assess the informativeness of data (how much we can improve the model by adding/swapping a training example) and verify if the training is feasible with the current set before obtaining any reference energies and forces -- a so-called off-line approach. In other words, we are focusing on an approach that is easy to implement and doesn't require sophisticated frameworks that involve automated access to high-performance computational (HPC).
翻译:数据驱动的机器学习原子相互作用模型通常基于灵活且非物理的函数,能够将原子排列的细微特征转化为能量和力的预测。因此,这些势函数的质量完全取决于训练数据(通常来自所谓的第一性原理模拟结果),我们需要确保模型拥有足够的信息以实现充分的准确性、可靠性和迁移性。主要挑战在于化学环境描述符通常属于稀疏高维对象,缺乏定义良好的连续度量空间。这使得任何临时选择的训练样本方法都难以避免偏差,极易陷入确认偏差陷阱——即使用相同的狭隘偏差采样生成训练集和测试集。我们将证明经典统计实验设计与优化方法可以较低计算成本缓解此类问题。该方法的核心特征在于:能够在获取任何参考能量和力之前(即离线方法),评估数据的信息量(通过添加/替换训练样例对模型的改进程度)并验证当前集合的训练可行性。换言之,我们聚焦于一种易于实施且无需涉及高性能计算自动化框架的简易方案。