Graph neural networks (GNNs) have achieved remarkable performance on graph-structured data. However, GNNs may inherit prejudice from the training data and make discriminatory predictions based on sensitive attributes, such as gender and race. Recently, there has been an increasing interest in ensuring fairness on GNNs, but all of them are under the assumption that the training and testing data are under the same distribution, i.e., training data and testing data are from the same graph. Will graph fairness performance decrease under distribution shifts? How does distribution shifts affect graph fairness learning? All these open questions are largely unexplored from a theoretical perspective. To answer these questions, we first theoretically identify the factors that determine bias on a graph. Subsequently, we explore the factors influencing fairness on testing graphs, with a noteworthy factor being the representation distances of certain groups between the training and testing graph. Motivated by our theoretical analysis, we propose our framework FatraGNN. Specifically, to guarantee fairness performance on unknown testing graphs, we propose a graph generator to produce numerous graphs with significant bias and under different distributions. Then we minimize the representation distances for each certain group between the training graph and generated graphs. This empowers our model to achieve high classification and fairness performance even on generated graphs with significant bias, thereby effectively handling unknown testing graphs. Experiments on real-world and semi-synthetic datasets demonstrate the effectiveness of our model in terms of both accuracy and fairness.
翻译:图神经网络(GNNs)在图结构数据上取得了显著性能。然而,GNNs可能继承训练数据中的偏见,并基于性别、种族等敏感属性做出歧视性预测。近年来,确保GNNs公平性的研究日益增多,但所有方法均假设训练与测试数据服从相同分布,即训练数据和测试数据来自同一张图。图公平性性能在分布偏移下是否会下降?分布偏移如何影响图公平性学习?这些开放性问题在很大程度上尚未从理论角度得到探索。为解答这些问题,我们首先从理论上识别了决定图中偏差的关键因素。随后,我们探究了影响测试图公平性的因素,其中一个关键因素是训练图与测试图中特定群体表征距离的差异。受理论分析启发,我们提出了FatraGNN框架。具体而言,为保证未知测试图上的公平性性能,我们提出一个图生成器,用于生成大量具有显著偏差且处于不同分布下的图。然后,我们最小化训练图与生成图中每个特定群体的表征距离。这使得我们的模型即使在存在显著偏差的生成图上也能实现高分类精度与公平性性能,从而有效应对未知测试图。在真实与半合成数据集上的实验证明了我们模型在准确性与公平性两方面的有效性。