Although recent advances in higher-order Graph Neural Networks (GNNs) improve the theoretical expressiveness and molecular property predictive performance, they often fall short of the empirical performance of models that explicitly use fragment information as inductive bias. However, for these approaches, there exists no theoretic expressivity study. In this work, we propose the Fragment-WL test, an extension to the well-known Weisfeiler & Leman (WL) test, which enables the theoretic analysis of these fragment-biased GNNs. Building on the insights gained from the Fragment-WL test, we develop a new GNN architecture and a fragmentation with infinite vocabulary that significantly boosts expressiveness. We show the effectiveness of our model on synthetic and real-world data where we outperform all GNNs on Peptides and have 12% lower error than all GNNs on ZINC and 34% lower error than other fragment-biased models. Furthermore, we show that our model exhibits superior generalization capabilities compared to the latest transformer-based architectures, positioning it as a robust solution for a range of molecular modeling tasks.
翻译:尽管高阶图神经网络(GNNs)的最新进展提升了理论表达能力和分子性质预测性能,但其经验性能往往不及那些显式利用片段信息作为归纳偏置的模型。然而,对于这些利用片段信息的方法,目前尚缺乏理论表达力研究。本文提出片段-威斯费勒-莱曼测试,作为对经典威斯费勒-莱曼(WL)测试的扩展,从而实现对这类片段偏置GNNs的理论分析。基于片段-WL测试的洞见,我们开发了一种新型GNN架构和具有无限词汇表的片段化方法,显著提升了表达能力。我们在合成数据集和真实数据集上验证了模型的有效性:在Peptides数据集上超越所有GNNs,在ZINC数据集上误差比所有GNNs降低12%,且比其他片段偏置模型误差降低34%。此外,我们证明该模型相较于最新的基于Transformer的架构展现出更优的泛化能力,为各类分子建模任务提供了稳健的解决方案。