Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA, TSDY and TSID. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility of constructing a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. In addition, it can avoid overfitting and achieve higher novelty scores while maintaining reasonable similarity on labeled low-resource datasets, regardless of whether the model is original, data-augmented, or pre-trained then fine-tuned. Furthermore, it significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. And it surpasses state-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9.
翻译:分子结构有效表示是影响人工智能模型性能的关键因素。本研究提出了一种灵活、基于片段的多尺度分子表示框架t-SMILES(基于树的SMILES),包含三种编码算法:TSSA、TSDY和TSID。该方法通过对从分子片段图构建的满二叉树执行广度优先搜索,生成SMILES类型字符串来描述分子。通过JTVAE、BRICS、MMPA和Scaffold的系统评估表明,构建多编码分子描述系统具有可行性,其中不同描述相互补充,从而提升整体性能。此外,在标记数据稀缺的数据集上,无论模型采用原始训练、数据增强还是预训练后微调的方式,该方法均能避免过拟合,在保持合理相似性的同时获得更高的新颖性得分。在目标导向任务中,该方法显著优于传统SMILES、DeepSMILES、SELFIES及基线模型。在ChEMBL、Zinc和QM9数据集上,其性能超越当前最先进的基于片段、图结构和SMILES的方法。