Efficient materials discovery requires reducing costly first-principles calculations for training machine-learned interatomic potentials (MLIPs). We develop an active learning (AL) framework that iteratively selects informative structures from the Materials Project and Open Quantum Materials Database (OQMD) using compositional and property-based descriptors with a neural network ensemble model. Query-by-Committee enables real-time uncertainty quantification. We compare four strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach. Experiments across four material systems (C, Si, Fe, and TiO2) with 5 random seeds demonstrate that diversity sampling achieves competitive or superior performance, with 10.9% improvement on TiO2. Our approach achieves equivalent accuracy with 5-13% fewer labeled samples than random baselines. The complete pipeline executes on Google Colab in under 4 hours per system using less than 8 GB RAM, democratizing MLIP development for resource-limited researchers. Open-source code and configurations are available on GitHub. This multi-system evaluation provides practical guidelines for data-efficient MLIP training and highlights integration with symmetry-aware architectures as a promising future direction.
翻译:高效的材料发现需要减少用于训练机器学习原子间势能(MLIPs)的高成本第一性原理计算。我们开发了一个主动学习(AL)框架,该框架利用基于成分和性质的描述符,结合神经网络集成模型,从Materials Project和开放量子材料数据库(OQMD)中迭代选择信息丰富的结构。委员会查询(Query-by-Committee)方法实现了实时不确定性量化。我们比较了四种策略:随机采样(基线)、基于不确定性的采样、基于多样性的采样(采用最远点优化的k-means聚类)以及混合方法。在四种材料系统(C、Si、Fe和TiO2)上使用5个随机种子进行的实验表明,多样性采样取得了具有竞争力或更优的性能,在TiO2上实现了10.9%的改进。我们的方法在达到同等精度时,比随机基线少用5-13%的标记样本。完整的流程在Google Colab上执行,每个系统耗时不到4小时且使用少于8 GB内存,为资源有限的研究人员普及了MLIP开发。开源代码和配置已在GitHub上提供。这项多系统评估为数据高效的MLIP训练提供了实用指南,并强调了与对称性感知架构的集成是一个有前景的未来方向。