As a fundamental task in computational chemistry, retrosynthesis prediction aims to identify a set of reactants to synthesize a target molecule. Existing template-free approaches only consider the graph structures of the target molecule, which often cannot generalize well to rare reaction types and large molecules. Here, we propose T-Rex, a text-assisted retrosynthesis prediction approach that exploits pre-trained text language models, such as ChatGPT, to assist the generation of reactants. T-Rex first exploits ChatGPT to generate a description for the target molecule and rank candidate reaction centers based both the description and the molecular graph. It then re-ranks these candidates by querying the descriptions for each reactants and examines which group of reactants can best synthesize the target molecule. We observed that T-Rex substantially outperformed graph-based state-of-the-art approaches on two datasets, indicating the effectiveness of considering text information. We further found that T-Rex outperformed the variant that only use ChatGPT-based description without the re-ranking step, demonstrate how our framework outperformed a straightforward integration of ChatGPT and graph information. Collectively, we show that text generated by pre-trained language models can substantially improve retrosynthesis prediction, opening up new avenues for exploiting ChatGPT to advance computational chemistry. And the codes can be found at https://github.com/lauyikfung/T-Rex.
翻译:作为计算化学领域的基础任务,逆合成预测旨在识别合成目标分子所需的反应物集合。现有无模板方法仅考虑目标分子的图结构,难以泛化至罕见反应类型及大分子场景。本文提出T-Rex,一种利用预训练文本语言模型(如ChatGPT)辅助反应物生成的文本辅助逆合成预测方法。T-Rex首先利用ChatGPT生成目标分子的文本描述,基于描述与分子图对候选反应中心进行排序,随后通过查询各反应物的文本描述并评估哪组反应物能最优合成目标分子,对候选方案进行重排序。实验表明,T-Rex在两个数据集上显著优于基于图结构的先进方法,验证了引入文本信息的有效性。进一步研究发现,T-Rex的性能优于仅使用ChatGPT生成描述而无重排序步骤的变体,凸显了本框架相较于直接整合ChatGPT与图信息方法的优势。综上,预训练语言模型生成的文本能显著提升逆合成预测性能,为利用ChatGPT推动计算化学发展开辟了新路径。相关代码已开源至https://github.com/lauyikfung/T-Rex。