In recent years, several reaction templates-based and template-free approaches have been reported for single-step retrosynthesis prediction. Even though many of these approaches perform well from traditional data-driven metrics standpoint, there is a disconnect between model architectures used and underlying chemistry principles governing retrosynthesis. Here, we propose a novel chemistry-aware retrosynthesis prediction framework that combines powerful data-driven models with chemistry knowledge. We report a tree-to-sequence transformer architecture based on hierarchical SMILES grammar trees as input containing underlying chemistry information that is otherwise ignored by models based on purely SMILES-based representations. The proposed framework, grammar-based molecular attention tree transformer (G-MATT), achieves significant performance improvements compared to baseline retrosynthesis models. G-MATT achieves a top-1 accuracy of 51% (top-10 accuracy of 79.1%), invalid rate of 1.5%, and bioactive similarity rate of 74.8%. Further analyses based on attention maps demonstrate G-MATT's ability to preserve chemistry knowledge without having to use extremely complex model architectures.
翻译:近年来,基于反应模板和无模板的方法已被广泛报道用于单步逆合成预测。尽管许多方法从传统数据驱动指标来看表现良好,但模型架构与指导逆合成的底层化学原理之间存在脱节。在此,我们提出了一种新颖的化学感知逆合成预测框架,将强大的数据驱动模型与化学知识相结合。我们报告了一种基于层次化SMILES语法树作为输入的树到序列Transformer架构,该架构包含被纯SMILES表示模型所忽略的底层化学信息。所提出的基于语法的分子注意力树Transformer(G-MATT)相较于基线逆合成模型实现了显著的性能提升。G-MATT在top-1准确率达到51%(top-10准确率为79.1%),无效率为1.5%,生物活性相似率为74.8%。基于注意力图的进一步分析表明,G-MATT能够在无需使用极其复杂模型架构的情况下保持化学知识。