Machine-learning interatomic potentials (MLIPs) have advanced rapidly, with many top models relying on strong physics-based inductive biases. However, as models scale to larger systems like biomolecules and electrolytes, they struggle to accurately capture long-range (LR) interactions, leading current approaches to rely on explicit physics-based terms or components. In this work, we propose AllScAIP, a straightforward, attention-based, and energy-conserving MLIP model that scales to O(100 million) training samples. It addresses the long-range challenge using an all-to-all node attention component that is data-driven. Extensive ablations reveal that in low-data/small-model regimes, inductive biases improve sample efficiency. However, as data and model size scale, these benefits diminish or even reverse, while all-to-all attention remains critical for capturing LR interactions. Our model achieves state-of-the-art energy/force accuracy on molecular systems, as well as a number of physics-based evaluations (OMol25), while being competitive on materials (OMat24) and catalysts (OC20). Furthermore, it enables stable, long-timescale MD simulations that accurately recover experimental observables, including density and heat of vaporization predictions.
翻译:机器学习原子间势(MLIPs)发展迅速,众多顶尖模型依赖于强物理先验的归纳偏置。然而,当模型扩展至生物分子和电解质等大体系时,现有方法难以准确捕捉长程相互作用,导致当前研究普遍依赖显式的物理修正项或组件。本研究提出AllScAIP模型——一种简洁、基于注意力机制且满足能量守恒的MLIP架构,可扩展至亿级(O(100 million))训练样本规模。该模型通过数据驱动的全节点注意力组件有效解决了长程相互作用建模难题。系统性的消融实验表明:在小数据/小模型场景下,归纳偏置能提升样本效率;但随着数据与模型规模的扩大,这种优势逐渐减弱甚至逆转,而全节点注意力机制对捕捉长程相互作用始终具有关键作用。我们的模型在分子体系上实现了最先进的能量/力预测精度,在物理性能评估基准(OMol25)中表现优异,同时在材料(OMat24)和催化剂(OC20)数据集上保持竞争力。此外,该模型支持稳定的大时间尺度分子动力学模拟,能够准确复现包括密度和汽化热预测在内的实验观测值。