Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.
翻译:尽管图神经网络近年来在分子属性预测任务中取得了显著成功,但其在分布外环境下的泛化能力仍待深入探究。与现有方法通过学习判别性表示进行预测不同,我们提出一种具有语义成分可识别性的生成模型,命名为SCI。我们证明该生成模型中的潜在变量可明确区分为语义相关和语义无关成分,通过利用因果机制中最小变化特性,有助于实现更优的分布外泛化。具体而言,我们首先形式化从原子层级到分子层级的数据生成过程,将潜在空间划分为语义无关子结构、语义相关子结构和语义相关原子变量。为降低误判风险,我们限制语义相关原子变量的最小变化,并通过语义潜在子结构正则化缓解域变化下语义相关子结构的方差。在温和假设下,我们证明了语义相关子结构的块级可识别性及语义相关原子变量的注释级可识别性。实验结果显示,该方法在3个主流基准测试的21个数据集中取得最先进性能及全面改进。此外,SCI方法可视化结果提供了富有洞见的案例分析与预测解释。代码见:https://github.com/DMIRLAB-Group/SCI。