Efficiently creating a concise but comprehensive data set for training machine-learned interatomic potentials (MLIPs) is an under-explored problem. Active learning (AL), which uses either biased or unbiased molecular dynamics (MD) simulations to generate candidate pools, aims to address this objective. Existing biased and unbiased MD simulations, however, are prone to miss either rare events or extrapolative regions -- areas of the configurational space where unreliable predictions are made. Simultaneously exploring both regions is necessary for developing uniformly accurate MLIPs. In this work, we demonstrate that MD simulations, when biased by the MLIP's energy uncertainty, effectively capture extrapolative regions and rare events without the need to know \textit{a priori} the system's transition temperatures and pressures. Exploiting automatic differentiation, we enhance bias-forces-driven MD simulations by introducing the concept of bias stress. We also employ calibrated ensemble-free uncertainties derived from sketched gradient features to yield MLIPs with similar or better accuracy than ensemble-based uncertainty methods at a lower computational cost. We use the proposed uncertainty-driven AL approach to develop MLIPs for two benchmark systems: alanine dipeptide and MIL-53(Al). Compared to MLIPs trained with conventional MD simulations, MLIPs trained with the proposed data-generation method more accurately represent the relevant configurational space for both atomic systems.
翻译:高效地创建简洁而全面的数据集以训练机器学习原子间势(MLIPs)是一个尚未充分探索的问题。主动学习(AL)利用有偏或无偏分子动力学(MD)模拟生成候选池,旨在实现这一目标。然而,现有的有偏和无偏MD模拟容易遗漏罕见事件或外推区域——即构型空间中预测不可靠的区域。同时探索这两个区域对于开发均匀精确的MLIPs至关重要。本工作中,我们证明当通过MLIP的能量不确定性进行偏差时,MD模拟能够有效捕捉外推区域和罕见事件,而无需先验知道系统的转变温度和压力。利用自动微分,我们通过引入偏差应力的概念增强了偏差力驱动的MD模拟。我们还采用源自草图梯度特征的校准无集成不确定性,以较低的计算成本生成精度与基于集成不确定性方法相似或更优的MLIPs。我们使用所提出的不确定性驱动AL方法为两个基准系统——丙氨酸二肽和MIL-53(Al)——开发了MLIPs。与使用传统MD模拟训练的MLIPs相比,使用所提出的数据生成方法训练的MLIPs能更准确地表示这两个原子系统的相关构型空间。