We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
翻译:本文提出连接感知基序序列化方法,该图到序列的表示方法使仅解码器Transformer能够通过标准下一标记预测学习分子图。对于分子性质预测,基于SMILES的下一标记预测方法具有良好的扩展性但缺乏显式拓扑结构,而图原生掩码建模方法虽能捕获连接性,却可能破坏关键化学细节。CamS通过将分子图序列化为结构丰富的因果序列来弥合这一鸿沟。该方法首先挖掘数据驱动的连接感知基序,随后通过基于骨架的广度优先搜索实现基序序列化,从而建立稳定的从核心到外围的顺序。至关重要的是,CamS通过串联从细粒度到粗粒度的多尺度基序序列,实现了分层建模机制,使模型能够基于密集且未破坏的局部结构证据来构建全局骨架。我们通过在CamS序列上预训练原始LLaMA主干网络实例化CamS-LLaMA模型。该模型在MoleculeNet基准数据集和活性悬崖基准MoleculeACE上均达到最先进性能,显著优于基于SMILES的语言模型与强图基线方法。可解释性分析证实,我们的多尺度因果序列化方法能有效驱动注意力机制聚焦于决定活性悬崖的关键结构差异。