Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules, and prevents fully leveraging generative AI for molecular inverse design. While early work has shown that ML-based retrosynthesis models can predict reasonable routes, their low accuracy for less frequent, yet important reactions has been pointed out. As multi-step search algorithms are limited to reactions suggested by the underlying model, the applicability of those tools is inherently constrained by the accuracy of retrosynthesis prediction. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models that combine predictions from diverse sources with complementary inductive biases using a learning-based ensembling strategy. We instantiate the framework with two newly developed models, which already by themselves achieve state of the art in their categories. Through experiments across several orders of magnitude in data scale and time-splits, we show Chimera outperforms all major models by a large margin, owing both to the good individual performance of its constituents, but also to the scalability of our ensembling strategy. Moreover, we find that PhD-level organic chemists prefer predictions from Chimera over baselines in terms of quality. Finally, we transfer the largest-scale checkpoint to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our framework unlocks, we anticipate further acceleration in the development of even more accurate models.

翻译：规划和实施化学合成仍然是功能性小分子发现的主要瓶颈，并阻碍了生成式人工智能在分子逆向设计中的充分应用。尽管早期研究表明基于机器学习的逆合成模型能够预测合理的合成路线，但其对低频但重要反应的预测准确性较低的问题已被指出。由于多步搜索算法受限于底层模型所建议的反应，这些工具的适用性本质上受逆合成预测准确性的制约。受化学家运用不同策略构思反应的启发，我们提出Chimera：一个通过基于学习的集成策略，将具有互补归纳偏置的多样化预测源相结合，构建高精度反应模型的框架。我们通过两个新开发的模型实例化该框架，这些模型本身已在各自类别中达到最先进水平。通过在多个数量级的数据规模和时间划分上的实验，我们证明Chimera以显著优势超越所有主流模型，这既得益于其组成模型的优异个体性能，也归功于我们集成策略的可扩展性。此外，我们发现博士级有机化学家在预测质量方面更倾向于Chimera而非基线模型。最后，我们将最大规模的检查点迁移至某大型制药公司的内部数据集，展示了在分布偏移下的稳健泛化能力。随着该框架开启的新维度，我们预期将加速开发更精确的模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

《用于无线通信和传感的智能反射面 (IRS)》（ICC 2022）新加坡国立大学2022最新53页slides

专知会员服务

26+阅读 · 2022年11月16日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日