Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of $\sim$120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks. Please visit https://nima.sh/jmp for further information.
翻译:基础模型在自然语言处理和计算机视觉等机器学习领域引发了变革性突破。然而,由于跨化学域训练有效模型的挑战,类似成功在原子性质预测中仍十分有限。为此,我们提出联合多域预训练(JMP)策略——一种有监督预训练方法,通过将不同化学领域的多个数据集视为多任务框架中的独立预训练任务进行同步训练。我们的组合训练数据集包含来自OC20、OC22、ANI-1x和Transition-1x的约1.2亿个系统。通过在QM9、rMD17、MatBench、QMOF、SPICE和MD22等多样化下游任务与数据集上进行微调,我们评估了模型的性能与泛化能力。JMP在平均性能上较从头训练提升59%,并在40项任务中达到或超越34项的当前最优水平。本研究凸显了利用多样化数据开展预训练策略的潜力,尤其能推动低数据量化学域的性质预测研究。更多信息请访问https://nima.sh/jmp。