Recent advances in large language models (LLMs) have demonstrated transformative potential across diverse fields. While LLMs have been applied to molecular simplified molecular input line entry system (SMILES) in computer-aided synthesis planning (CASP), existing methodologies typically address single tasks, such as precursor prediction. We introduce ChemBART, a SMILES-based LLM pre-trained on chemical reactions, which enables a unified model for multiple downstream chemical tasks--achieving the paradigm of "one model, one pre-training, multiple tasks." By leveraging outputs from a mask-filling pre-training task on reaction expressions, ChemBART effectively solves a variety of chemical problems, including precursor/reagent generation, temperature-yield regression, molecular property classification, and optimizing the policy and value functions within a reinforcement learning framework, integrated with Monte Carlo tree search for multi-step synthesis route design. Unlike single-molecule pre-trained LLMs constrained to specific applications, ChemBART addresses broader chemical challenges and integrates them for comprehensive synthesis planning. Crucially, ChemBART-designed multi-step synthesis routes and reaction conditions directly inspired wet-lab validation, which confirmed shorter pathways with ~30% yield improvement over literature benchmarks. Our work validates the power of reaction-focused pre-training and showcases the broad utility of ChemBART in advancing the complete synthesis planning cycle.
翻译:近年来,大型语言模型(LLMs)的进展展现出跨领域的变革潜力。虽然LLMs已应用于计算机辅助合成规划(CASP)中的分子简化分子线性输入规范(SMILES),但现有方法通常仅针对单一任务(如前体预测)。本文提出ChemBART,这是一种基于SMILES、在化学反应数据上预训练的大型语言模型,能够为多个下游化学任务提供统一模型——实现“一个模型、一次预训练、多类任务”的范式。通过利用掩码填充预训练任务在反应表达式上的输出,ChemBART有效解决了多种化学问题,包括前体/试剂生成、温度-产率回归、分子性质分类,以及在强化学习框架内优化策略函数与价值函数,并结合蒙特卡洛树搜索进行多步合成路线设计。与局限于特定应用的单分子预训练LLMs不同,ChemBART处理更广泛的化学挑战并将其整合为综合合成规划。关键的是,ChemBART设计的多步合成路线与反应条件直接启发了湿实验验证,实验证实其路径更短,产率较文献基准提升约30%。本研究验证了以反应为中心的预训练方法的效能,并展示了ChemBART在推进完整合成规划周期中的广泛实用性。