Drug discovery motivates efficient molecular property prediction under limited labeled data. Chemical space is vast, often estimated at approximately 10^60 drug-like molecules, while only thousands of drugs have been approved. As a result, self-supervised pretraining on large unlabeled molecular corpora has become essential for data-efficient molecular representation learning. We introduce **CardinalGraphFormer**, a graph transformer that incorporates Graphormer-inspired structural biases, including shortest-path distance and centrality, as well as direct-bond edge bias, within a structured sparse attention regime limited to shortest-path distance <= 3. The model further augments this design with a cardinality-preserving unnormalized aggregation channel over the same support set. Pretraining combines contrastive graph-level alignment with masked attribute reconstruction. Under a fully matched evaluation protocol, CardinalGraphFormer improves mean performance across all 11 evaluated tasks and achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks when compared to strong reproduced baselines.
翻译:药物发现要求在有限标注数据下实现高效的分子性质预测。化学空间极为广阔,通常估计包含约10^60个类药分子,而已获批药物仅数千种。因此,在大型未标记分子库上进行自监督预训练已成为数据高效分子表示学习的关键。我们提出**CardinalGraphFormer**——一种在图Transformer中引入Graphormer启发的结构偏置(包括最短路径距离与中心性)以及直接键连边偏置,并将其约束在最短路径距离≤3的结构化稀疏注意力机制中。该模型进一步通过在同支撑集上构建基数保持的未归一化聚合通道来增强此设计。预训练过程结合了对比图级对齐与掩码属性重建任务。在完全匹配的评估协议下,相较于复现的强基线模型,CardinalGraphFormer在全部11项评估任务中平均性能均有提升,并在涵盖MoleculeNet、OGB及TDC ADMET任务的11个公共基准中的10个上取得了统计显著性改进。