Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.
翻译:近年来,大语言模型(LLMs)在药物发现(包括合成路线规划)中的应用不断扩展。然而,对逆合成性能的客观评估仍然有限。现有的基准和指标通常依赖于已发表的合成路线以及基于单一标准答案的Top-K准确率,这未能捕捉到现实世界合成路线规划的开放本质。我们提出了一个新的单步逆合成基准框架,该框架使用ChemCensor(一种用于评估化学合理性的新颖指标)来评估通用型和化学专用型LLMs。通过强调合理性而非精确匹配,该方法能更好地与人类合成规划实践保持一致。我们还引入了CREED,这是一个包含数百万条经ChemCensor验证的反应记录、用于LLM训练的新颖数据集,并利用它训练了一个模型,该模型在此基准下超越了LLM基线。