Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Lead optimization in drug discovery requires improving therapeutic properties while ensuring that proposed molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment that invokes specialized chemical analysis tools to identify reactive sites and propose chemically grounded transformations from matched templates. A policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step reaction trajectories. A SMILES-based caching mechanism further reduces end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.563, outperforming the strongest synthesizable baseline by 10.4% in relative improvement, and attains the best sample efficiency on 10 of 14 tasks. Ablations confirm that both tool-augmented reaction proposals and trajectory-level policy optimization contribute complementary gains. By grounding every step in validated reaction templates, MolReAct produces molecules that are property-improved and each accompanied by an explicit synthetic pathway.

翻译：药物发现中的先导化合物优化要求在改善治疗性质的同时，确保所提出的分子修饰对应可行的合成路线。现有方法要么优先考虑性质得分而不强制可合成性，要么依赖对大型反应网络的昂贵枚举，而直接应用大语言模型（LLM）常产生化学上无效的结构。我们提出MolReAct框架，将先导化合物优化建模为在由已验证反应模板定义的合成约束动作空间上的马尔可夫决策过程。一个工具增强型LLM智能体作为动态反应环境，调用专门化学分析工具识别反应位点，并从匹配模板中提出基于化学背景的变换。通过群体相对策略优化（GRPO）训练的策略模型在这些约束动作中进行选择，以最大化跨多步反应轨迹的长期奖励。基于SMILES的缓存机制进一步将端到端优化时间减少约43%。在来自治疗数据共享中心的13个性质优化任务和一个基于结构的对接任务中，MolReAct取得了0.563的平均Top-10得分，相对改进幅度超过最强可合成基线10.4%，并在14个任务中的10个上达到最佳样本效率。消融实验证实，工具增强的反应提议和轨迹级策略优化均贡献了互补性增益。通过将每一步建立在已验证反应模板上，MolReACT生成性质改善且每个分子均附带明确合成路径的分子。