We propose MatSci ML, a novel benchmark for modeling MATerials SCIence using Machine Learning (MatSci ML) methods focused on solid-state materials with periodic crystal structures. Applying machine learning methods to solid-state materials is a nascent field with substantial fragmentation largely driven by the great variety of datasets used to develop machine learning models. This fragmentation makes comparing the performance and generalizability of different methods difficult, thereby hindering overall research progress in the field. Building on top of open-source datasets, including large-scale datasets like the OpenCatalyst, OQMD, NOMAD, the Carolina Materials Database, and Materials Project, the MatSci ML benchmark provides a diverse set of materials systems and properties data for model training and evaluation, including simulated energies, atomic forces, material bandgaps, as well as classification data for crystal symmetries via space groups. The diversity of properties in MatSci ML makes the implementation and evaluation of multi-task learning algorithms for solid-state materials possible, while the diversity of datasets facilitates the development of new, more generalized algorithms and methods across multiple datasets. In the multi-dataset learning setting, MatSci ML enables researchers to combine observations from multiple datasets to perform joint prediction of common properties, such as energy and forces. Using MatSci ML, we evaluate the performance of different graph neural networks and equivariant point cloud networks on several benchmark tasks spanning single task, multitask, and multi-data learning scenarios. Our open-source code is available at https://github.com/IntelLabs/matsciml.
翻译:我们提出MatSci ML,一个利用机器学习方法建模固体材料科学的新型基准,聚焦于具有周期性晶体结构的固体材料。将机器学习方法应用于固体材料是一个新兴领域,且因开发模型所用数据集的极大多样性而存在严重碎片化问题。这种碎片化使得不同方法的性能和泛化能力难以比较,从而阻碍该领域的整体研究进展。基于开源数据集(包括OpenCatalyst、OQMD、NOMAD、卡罗来纳材料数据库和Materials Project等大规模数据集),MatSci ML基准为模型训练与评估提供了多样化的材料体系和性质数据,涵盖模拟能量、原子力、材料带隙以及通过空间群表示的晶体对称性分类数据。MatSci ML中性质的多样性使得实现和评估固体材料的多任务学习算法成为可能,而数据集的多样性则促进了跨多个数据集的新颖、更通用算法与方法的开发。在多数据集学习场景下,MatSci ML使研究人员能够整合来自多个数据集的观测数据,对能量和力等共有性质进行联合预测。利用MatSci ML,我们评估了不同图神经网络和等变点云网络在单任务、多任务及多数据学习场景下的多项基准任务中的性能。我们的开源代码见https://github.com/IntelLabs/matsciml。