Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA (t-SMILES with Shared Atom), TSDY (t-SMILES with Dummy Atom) and TSID (t-SMILES with ID). It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility to construct a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Additionally, it exhibits impressive performance on low-resource datasets, whether the model is original, data augmented, or pre-training fine-tuned. It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. Furthermore, it surpasses start-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9.
翻译:有效表示分子是影响人工智能模型性能的关键因素。本研究提出了一种灵活、基于片段的多尺度分子表示框架——t-SMILES(基于树的SMILES),并包含三种编码算法:TSSA(t-SMILES共享原子)、TSDY(t-SMILES虚拟原子)和TSID(t-SMILES标识符)。该框架通过对由分子片段图构建的完全二叉树进行广度优先搜索,生成SMILES类型字符串来描述分子。基于JTVAE、BRICS、MMPA和Scaffold的系统评估表明,可构建多代码分子描述系统,其中各种描述相互补充,从而提升整体性能。此外,在低资源数据集上(无论是原始模型、数据增强还是预训练微调),该框架均展现出优异性能。在定向任务中,它显著优于经典SMILES、DeepSMILES、SELFIES及基线模型。更关键的是,在ChEMBL、Zinc和QM9数据集上,该方法超越了现有基于片段、图和SMILES的最新方法。