EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations

Equivariant graph neural networks force fields (EGraFFs) have shown great promise in modelling complex interactions in atomic systems by exploiting the graphs' inherent symmetries. Recent works have led to a surge in the development of novel architectures that incorporate equivariance-based inductive biases alongside architectural innovations like graph transformers and message passing to model atomic interactions. However, thorough evaluations of these deploying EGraFFs for the downstream task of real-world atomistic simulations, is lacking. To this end, here we perform a systematic benchmarking of 6 EGraFF algorithms (NequIP, Allegro, BOTNet, MACE, Equiformer, TorchMDNet), with the aim of understanding their capabilities and limitations for realistic atomistic simulations. In addition to our thorough evaluation and analysis on eight existing datasets based on the benchmarking literature, we release two new benchmark datasets, propose four new metrics, and three challenging tasks. The new datasets and tasks evaluate the performance of EGraFF to out-of-distribution data, in terms of different crystal structures, temperatures, and new molecules. Interestingly, evaluation of the EGraFF models based on dynamic simulations reveals that having a lower error on energy or force does not guarantee stable or reliable simulation or faithful replication of the atomic structures. Moreover, we find that no model clearly outperforms other models on all datasets and tasks. Importantly, we show that the performance of all the models on out-of-distribution datasets is unreliable, pointing to the need for the development of a foundation model for force fields that can be used in real-world simulations. In summary, this work establishes a rigorous framework for evaluating machine learning force fields in the context of atomic simulations and points to open research challenges within this domain.

翻译：等变图神经网络力场通过利用图的固有对称性，在模拟原子系统的复杂相互作用方面展现出巨大潜力。近年来的研究推动了新型架构的涌现，这些架构将基于等变性的归纳偏置与图变换器、消息传递等架构创新相结合，以模拟原子相互作用。然而，针对这些等变图神经网络力场在实际原子模拟下游任务中的全面评估尚显不足。为此，本文对六种等变图神经网络力场算法（NequIP、Allegro、BOTNet、MACE、Equiformer、TorchMDNet）进行了系统基准测试，旨在理解其在实际原子模拟中的能力与局限性。除了基于基准测试文献对八个现有数据集进行深入评估与分析外，我们还发布了两个新的基准数据集，提出了四个新指标和三个具有挑战性的任务。这些新数据集和任务从不同晶体结构、温度和新分子的角度，评估了等变图神经网络力场对分布外数据的性能。有趣的是，基于动态模拟的等变图神经网络力场模型评估显示，能量或力上的较低误差并不能保证模拟的稳定可靠或原子结构的忠实复现。此外，我们发现没有模型能在所有数据集和任务上明显优于其他模型。重要的是，所有模型在分布外数据集上的表现均不可靠，这表明需要开发一种可用于实际模拟的力场基础模型。总之，这项工作为在原子模拟背景下评估机器学习力场建立了严格框架，并指出了该领域内待解决的研究挑战。