Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.
翻译:大型语言模型(LLM)在逆合成等科学领域展现出显著潜力,但其往往缺乏精细控制能力,难以在复杂问题空间中无差错地导航。一个关键挑战在于引导LLM避开分子上特定的化学敏感位点——若采用无约束生成,可能导致无效或不理想的合成路径。本研究提出Protect$^*$,一种将大型语言模型(LLM)的生成能力基于严格化学逻辑的神经符号框架。该方法结合了基于规则的自动推理(使用包含55+种SMARTS模式和40+种特征化保护基的全面数据库)与神经模型的生成直觉。系统通过混合架构运行:在“自动模式”下,符号逻辑确定性地识别并保护反应位点;在“人在回路模式”下,则整合专家策略约束。通过“主动状态追踪”,我们将硬性符号约束通过关联规范原子映射的专用保护状态注入神经推理过程。我们通过对复杂天然产物的案例研究(包括发现红霉素B的新型合成路径)验证了该神经符号方法,表明将神经生成基于符号逻辑可实现可靠、专家级的自主性。