Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enforcing synthesizability, or rely on expensive enumeration over large reaction networks, while direct application of Large Language Models (LLMs) to molecular generation frequently produces chemically invalid structures. We introduce MolReAct, a framework that formulates lead optimization as a Markov Decision Process over a synthesis-constrained action space defined by validated reaction templates. A tool-augmented LLM agent serves as a dynamic reaction environment, invoking specialized chemical analysis tools to identify reactive sites and functional groups and proposing a compact set of chemically grounded transformations from matched templates. A dedicated policy model trained via Group Relative Policy Optimization (GRPO) selects among these constrained actions to maximize long-term oracle reward across multi-step trajectories, with a SMILES-based caching mechanism reducing end-to-end optimization time by approximately 43%. Across 13 property optimization tasks from the Therapeutic Data Commons and one structure-based docking task, MolReAct achieves an average Top-10 score of 0.571, the highest among all baselines, ranking first or second on 13 of 14 tasks and attaining the best sample efficiency on 9 of 14 tasks. By grounding every optimization step in validated reaction templates, MolReAct produces molecules that are not only property-improved but each accompanied by an explicit template-grounded synthetic pathway.

翻译：在药物发现中，先导优化需在改善治疗属性的同时确保分子修饰对应可行的合成路线。现有方法要么优先考虑属性评分而不强制合成可行性，要么依赖对大型反应网络的高成本枚举，而直接应用大型语言模型（LLM）进行分子生成常产生化学无效结构。我们提出MolReAct框架，该框架将先导优化构建为基于已验证反应模板定义的合成约束动作空间的马尔可夫决策过程。一个工具增强型LLM智能体作为动态反应环境，调用专用化学分析工具识别反应位点和官能团，从匹配模板中提出一组紧凑的化学可行变换。通过组相对策略优化（GRPO）训练的专用策略模型在约束动作中进行选择，以最大化多步轨迹中长期奖励函数，同时基于SMILES的缓存机制可将端到端优化时间缩短约43%。在治疗数据共享平台的13个属性优化任务和1个基于结构对接任务中，MolReAct的前10名平均得分为0.571，为所有基线最高，在14个任务中13个排名第一或第二，9个任务达到最佳样本效率。通过将每个优化步骤锚定于已验证反应模板，MolReAct生成的分子不仅属性得到改善，且每条路径均附有明确的模板化合成通路。