Language models demonstrate fundamental abilities in syntax, semantics, and reasoning, though their performance often depends significantly on the inputs they process. This study introduces TSIS (Simplified TSID) and its variants:TSISD (TSIS with Depth-First Search), TSISO (TSIS in Order), and TSISR (TSIS in Random), as integral components of the t-SMILES framework. These additions complete the framework's design, providing diverse approaches to molecular representation. Through comprehensive analysis and experiments employing deep generative models, including GPT, diffusion models, and reinforcement learning, the findings reveal that the hierarchical structure of t-SMILES is more straightforward to parse than initially anticipated. Furthermore, t-SMILES consistently outperforms other linear representations such as SMILES, SELFIES, and SAFE, demonstrating superior convergence speed and enhanced generalization capabilities.
翻译:语言模型在语法、语义和推理方面展现出基础能力,但其性能往往显著依赖于所处理的输入。本研究引入TSIS(简化TSID)及其变体:TSISD(采用深度优先搜索的TSIS)、TSISO(有序TSIS)和TSISR(随机TSIS),作为t-SMILES框架的核心组成部分。这些补充完善了该框架的设计,为分子表示提供了多样化方法。通过采用包括GPT、扩散模型和强化学习在内的深度生成模型进行综合分析与实验,研究结果表明,t-SMILES的层次化结构比最初预期的更易于解析。此外,t-SMILES在收敛速度和泛化能力方面均持续优于SMILES、SELFIES和SAFE等其他线性表示方法。