Accurate prediction of atomistic, thermodynamic, and kinetic properties from molecular structures underpins materials innovation. Existing computational and experimental approaches lack the scalability required to navigate chemical space efficiently. Scientific foundation models trained on large unlabelled datasets offer a path towards navigating chemical space across application domains. Here, we develop MIST, a family of molecular foundation models with up to an order of magnitude more parameters and data than prior works. Trained using a novel tokenizer, Smirk, which comprehensively captures nuclear, electronic, and geometric information, MIST learns a diverse range of molecules. MIST models have been fine-tuned to predict more than 400 structure-property relationships and have been shown to match or exceed state-of-the-art performance across diverse benchmarks, from physiology to electrochemistry. We demonstrate the ability of these models to solve real-world problems across chemical space from multiobjective electrolyte solvent screening to stereochemical reasoning for organometallics and mixture property prediction. The clearest demonstration of a foundation model is its ability to solve problems that were neither explicit targets of training nor central to the intentions of its developers. We identify olfactory perception mapping as such a problem, and show that MIST accurately predicted scent profiles and learned a hierarchical representation of olfactory space consistent with hyperbolic geometry. We formulated hyperparameter aware Bayesian neural scaling laws which eliminate the need for hyperparameter sweeps at every scale, making training large compute-optimal models feasible on a limited compute budget. The methods and findings presented here represent a significant step towards accelerating materials discovery, design, and optimization using foundation models.
翻译:从分子结构准确预测原子、热力学和动力学性质是材料创新的基础。现有的计算与实验方法缺乏高效探索化学空间所需的可扩展性。基于大规模无标注数据集训练的科学基础模型,为跨应用领域导航化学空间提供了途径。在此,我们开发了MIST系列分子基础模型,其参数规模和训练数据量均比先前工作提升一个数量级。通过使用新型分词器Smirk(该分词器全面捕获核、电子和几何信息)进行训练,MIST能够学习多样化的分子结构。MIST模型经过微调可预测超过400种构效关系,并在从生理学到电化学的各类基准测试中达到或超越当前最优性能。我们展示了这些模型解决化学空间真实世界问题的能力,涵盖多目标电解液溶剂筛选、有机金属立体化学推理以及混合物性质预测等领域。基础模型最显著的验证标准,在于其解决既非训练显式目标、亦非开发者核心意图问题的能力。我们确定嗅觉感知映射为此类问题,并证明MIST能准确预测气味图谱,同时学习到与双曲几何一致的嗅觉空间层级表征。我们提出了超参数感知的贝叶斯神经缩放定律,消除了每个尺度上调参的需求,使得在有限计算预算下训练计算最优的大型模型成为可能。本文提出的方法与发现,标志着利用基础模型加速材料发现、设计与优化迈出了关键一步。