Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems

Efficient discovery of new materials demands strategies to reduce the number of costly first-principles calculations required to train predictive machine learning models. We develop and validate an active learning framework that iteratively selects informative training structures for machine-learned interatomic potentials (MLIPs) from large, heterogeneous materials databases, specifically the Materials Project and OQMD. Our framework integrates compositional and property-based descriptors with a neural network ensemble model, enabling real-time uncertainty quantification via Query-by-Committee. We systematically compare four selection strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach balancing both objectives. Experiments across four representative material systems (elemental carbon, silicon, iron, and a titanium-oxide compound) with 5 random seeds per configuration demonstrate that diversity sampling consistently achieves competitive or superior performance, with particularly strong advantages on complex systems like titanium-oxide (10.9% improvement, p=0.008). Our results show that intelligent data selection strategies can achieve target accuracy with 5-13% fewer labeled samples compared to random baselines. The entire pipeline executes on Google Colab in under 4 hours per system using less than 8 GB of RAM, thereby democratizing MLIP development for researchers globally with limited computational resources. Our open-source code and detailed experimental configurations are available on GitHub. This multi-system evaluation establishes practical guidelines for data-efficient MLIP training and highlights promising future directions including integration with symmetry-aware neural network architectures.

翻译：新材料的高效发现需要减少训练预测性机器学习模型所需的高成本第一性原理计算次数。我们开发并验证了一种主动学习框架，该框架能够从大型异质材料数据库（特别是Materials Project和OQMD）中迭代选择信息丰富的训练结构，用于机器学习原子间势能（MLIPs）的训练。我们的框架将成分与基于性质的描述符与神经网络集成模型相结合，通过委员会查询实现实时不确定性量化。我们系统比较了四种选择策略：随机采样（基线）、基于不确定性的采样、基于多样性的采样（采用最远点优化的k均值聚类）以及平衡两者目标的混合方法。在四个代表性材料系统（单质碳、硅、铁以及钛氧化物化合物）上进行的实验（每种配置使用5个随机种子）表明，多样性采样始终能取得竞争性或更优的性能，在钛氧化物等复杂系统上优势尤为显著（性能提升10.9%，p=0.008）。我们的结果表明，与随机基线相比，智能数据选择策略能以减少5-13%的标记样本量达到目标精度。整个流程在Google Colab上执行，每个系统耗时不足4小时且内存占用低于8 GB，从而为全球计算资源有限的研究者降低了MLIP开发门槛。我们的开源代码及详细实验配置已发布于GitHub。这项多系统评估为数据高效的MLIP训练建立了实用指南，并指出了有前景的未来研究方向，包括与对称感知神经网络架构的集成。