We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
翻译:我们提出了连接感知基序序列化(CamS),一种图到序列的表示方法,使得仅解码器Transformer能够通过标准下一标记预测(NTP)学习分子图。对于分子性质预测,基于SMILES的NTP扩展性良好但缺乏显式拓扑结构,而图原生掩码建模虽能捕获连接性,却可能破坏关键的化学细节(例如活性悬崖)。CamS通过将分子图序列化为富含结构的因果序列来弥合这一差距。CamS首先挖掘数据驱动的连接感知基序,然后通过基于骨架的广度优先搜索(BFS)将基序序列化,以建立稳定的从核心到外围的顺序。关键的是,CamS通过将细粒度到粗粒度基序尺度的序列进行拼接,实现了分层建模,使得模型能够基于密集、未破坏的局部结构证据来调节全局骨架。我们通过在CamS序列上预训练一个原始LLaMA主干来实例化CamS-LLaMA。该方法在MoleculeNet和活性悬崖基准测试MoleculeACE上实现了最先进的性能,超越了基于SMILES的语言模型和强大的图基线。可解释性分析证实,我们的多尺度因果序列化有效地将注意力导向决定悬崖的关键差异。