Out-of-distribution (OOD) generalization in the graph domain is challenging due to complex distribution shifts and a lack of environmental contexts. Recent methods attempt to enhance graph OOD generalization by generating flat environments. However, such flat environments come with inherent limitations to capture more complex data distributions. Considering the DrugOOD dataset, which contains diverse training environments (e.g., scaffold, size, etc.), flat contexts cannot sufficiently address its high heterogeneity. Thus, a new challenge is posed to generate more semantically enriched environments to enhance graph invariant learning for handling distribution shifts. In this paper, we propose a novel approach to generate hierarchical semantic environments for each graph. Firstly, given an input graph, we explicitly extract variant subgraphs from the input graph to generate proxy predictions on local environments. Then, stochastic attention mechanisms are employed to re-extract the subgraphs for regenerating global environments in a hierarchical manner. In addition, we introduce a new learning objective that guides our model to learn the diversity of environments within the same hierarchy while maintaining consistency across different hierarchies. This approach enables our model to consider the relationships between environments and facilitates robust graph invariant learning. Extensive experiments on real-world graph data have demonstrated the effectiveness of our framework. Particularly, in the challenging dataset DrugOOD, our method achieves up to 1.29% and 2.83% improvement over the best baselines on IC50 and EC50 prediction tasks, respectively.
翻译:图数据中的分布外泛化由于复杂的分布偏移和环境上下文的缺失而具有挑战性。现有方法试图通过生成扁平环境来增强图数据的分布外泛化能力。然而,这种扁平环境在捕捉更复杂数据分布方面存在固有局限。以包含多样化训练环境的DrugOOD数据集为例,其环境因素包括分子骨架、尺寸等,扁平化上下文无法充分应对其高度异质性。因此,如何生成语义更丰富的环境以增强图不变性学习来处理分布偏移,成为新的挑战。本文提出一种为每个图生成分层语义环境的新方法。首先,给定输入图,我们显式地从输入图中提取变异子图,以生成局部环境的代理预测。随后,采用随机注意力机制重新提取子图,以分层方式重构全局环境。此外,我们引入新的学习目标,指导模型在保持不同层级间一致性的同时,学习同一层级内环境的多样性。该方法使模型能够考虑环境间的关系,并促进鲁棒的图不变性学习。在真实世界图数据上的大量实验证明了我们框架的有效性。特别是在具有挑战性的DrugOOD数据集中,我们的方法在IC50和EC50预测任务上分别比最佳基线模型提升了1.29%和2.83%。