Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.
翻译:近期进展扩展了大语言模型(LLMs)在药物发现中的应用,包括合成规划。然而,逆合成性能的客观评估仍存在局限。现有基准和指标通常依赖已发表的合成路线及基于单一真实结果的Top-K准确率,这未能捕捉现实合成规划的开放性特征。我们提出了一种新的单步逆合成基准框架,该框架使用化学合理性新指标ChemCensor评估通用型及化学专业型大语言模型。通过强调合理性而非精确匹配,本方法更符合人类合成规划实践。我们还引入了CREED这一新型数据集,包含数百万条经ChemCensor验证的反应记录用于大语言模型训练,并基于该数据集训练的模型在本基准下显著优于现有大语言模型基线。