Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.
翻译:机器学习原子间势在计算化学和材料科学中发挥着基础性作用,其应用涵盖从分子动力学模拟到药物设计和材料发现。尽管近期方法能够高精度估计原子间作用力,但其在多大程度上能泛化到未见过的分子仍不明确:这些方法究竟是学习了化学的组成结构,理解分子片段及其组合如何决定性质,还是主要学习插值训练样本特有的模式?为探究此问题,我们提出了一个由四项任务组成的基准测试,每项任务都需要某种形式的组成泛化能力。在每项任务中,模型需对训练期间未出现的分子进行测试,但训练数据经过精心选择——若能学习底层物理原理,模型应能从训练示例泛化到测试样本。实证分析表明,这些任务对当前最先进模型极具挑战性:即使在预训练于数百万分子的基础模型上,分布外样本的误差也常比分布内样本高出一个数量级。