Drug discovery motivates accurate molecular property prediction when labeled data are limited and candidate spaces are vast. This article presents CardinalGraphFormer, a graph transformer that augments structured attention with a query-conditioned gated unnormalized aggregation channel to preserve dynamic cardinality signals, complemented by graph-specific structural biases; a locality prior via sparse masking provides scalability for larger graphs. For typical drug-like molecules (K = 3 is near-global), masking acts mainly as a regularizer; for larger graphs it provides meaningful efficiency gains. Pretraining unifies contrastive alignment of augmented graph views and masked reconstruction of attributes. Evaluations on public benchmarks show consistent gains over baselines, isolated via controls for capacity, objectives, and size effects. Ablations confirm the cardinality channel's contributions beyond simpler approximations, with efficiency benefits on large molecules. Code, artifacts, and protocols emphasize reproducibility.
翻译:在标记数据有限且候选空间广阔的情况下,药物发现对精确的分子性质预测提出了迫切需求。本文提出CardinalGraphFormer,这是一种图Transformer模型,它通过一个查询条件门控的非归一化聚合通道来增强结构化注意力机制,以保持动态基数信号,并辅以图特有的结构偏置;通过稀疏掩码引入的局部性先验为更大规模的图提供了可扩展性。对于典型的类药分子(K = 3 接近全局),掩码主要起到正则化作用;对于更大的图,它能带来显著的效率提升。预训练过程统一了增强图视图的对比对齐与属性的掩码重建。在公开基准测试上的评估表明,该模型在控制模型容量、训练目标和规模效应后,相对于基线方法取得了持续的性能提升。消融实验证实了基数通道相较于简单近似方法的额外贡献,并在大分子上显示出效率优势。代码、实验制品与协议均强调研究的可复现性。