Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of $\sim$120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks.
翻译:基础模型在自然语言处理与计算机视觉等机器学习领域已带来变革性突破。然而由于跨化学域有效训练模型的挑战,原子性质预测领域尚未取得类似成功。为此,我们提出联合多域预训练策略(JMP),这是一种有监督预训练方法,在多任务框架中将来自不同化学域的多个数据集同时作为独立预训练任务进行训练。我们的联合训练数据集包含来自OC20、OC22、ANI-1x和Transition-1x的约1.2亿个体系。通过在下游任务和数据集(包括QM9、rMD17、MatBench、QMOF、SPICE和MD22)上的微调评估性能与泛化能力,JMP相比从零训练平均提升59%,在40项任务中的34项上达到或超越当前最优水平。本工作揭示了利用多样化数据进行预训练的策略在推动化学域性质预测(尤其低数据任务)中的巨大潜力。