Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules, and prevents fully leveraging generative AI for molecular inverse design. While early work has shown that ML-based retrosynthesis models can predict reasonable routes, their low accuracy for less frequent, yet important reactions has been pointed out. As multi-step search algorithms are limited to reactions suggested by the underlying model, the applicability of those tools is inherently constrained by the accuracy of retrosynthesis prediction. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models that combine predictions from diverse sources with complementary inductive biases using a learning-based ensembling strategy. We instantiate the framework with two newly developed models, which already by themselves achieve state of the art in their categories. Through experiments across several orders of magnitude in data scale and time-splits, we show Chimera outperforms all major models by a large margin, owing both to the good individual performance of its constituents, but also to the scalability of our ensembling strategy. Moreover, we find that PhD-level organic chemists prefer predictions from Chimera over baselines in terms of quality. Finally, we transfer the largest-scale checkpoint to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our framework unlocks, we anticipate further acceleration in the development of even more accurate models.
翻译:规划和实施化学合成仍然是功能性小分子发现的主要瓶颈,并阻碍了生成式人工智能在分子逆向设计中的充分应用。尽管早期研究表明基于机器学习的逆合成模型能够预测合理的合成路线,但其对低频但重要反应的预测准确性较低的问题已被指出。由于多步搜索算法受限于底层模型所建议的反应,这些工具的适用性本质上受逆合成预测准确性的制约。受化学家运用不同策略构思反应的启发,我们提出Chimera:一个通过基于学习的集成策略,将具有互补归纳偏置的多样化预测源相结合,构建高精度反应模型的框架。我们通过两个新开发的模型实例化该框架,这些模型本身已在各自类别中达到最先进水平。通过在多个数量级的数据规模和时间划分上的实验,我们证明Chimera以显著优势超越所有主流模型,这既得益于其组成模型的优异个体性能,也归功于我们集成策略的可扩展性。此外,我们发现博士级有机化学家在预测质量方面更倾向于Chimera而非基线模型。最后,我们将最大规模的检查点迁移至某大型制药公司的内部数据集,展示了在分布偏移下的稳健泛化能力。随着该框架开启的新维度,我们预期将加速开发更精确的模型。