GraphGuard: Detecting and Counteracting Training Data Misuse in Graph Neural Networks

The emergence of Graph Neural Networks (GNNs) in graph data analysis and their deployment on Machine Learning as a Service platforms have raised critical concerns about data misuse during model training. This situation is further exacerbated due to the lack of transparency in local training processes, potentially leading to the unauthorized accumulation of large volumes of graph data, thereby infringing on the intellectual property rights of data owners. Existing methodologies often address either data misuse detection or mitigation, and are primarily designed for local GNN models rather than cloud-based MLaaS platforms. These limitations call for an effective and comprehensive solution that detects and mitigates data misuse without requiring exact training data while respecting the proprietary nature of such data. This paper introduces a pioneering approach called GraphGuard, to tackle these challenges. We propose a training-data-free method that not only detects graph data misuse but also mitigates its impact via targeted unlearning, all without relying on the original training data. Our innovative misuse detection technique employs membership inference with radioactive data, enhancing the distinguishability between member and non-member data distributions. For mitigation, we utilize synthetic graphs that emulate the characteristics previously learned by the target model, enabling effective unlearning even in the absence of exact graph data. We conduct comprehensive experiments utilizing four real-world graph datasets to demonstrate the efficacy of GraphGuard in both detection and unlearning. We show that GraphGuard attains a near-perfect detection rate of approximately 100% across these datasets with various GNN models. In addition, it performs unlearning by eliminating the impact of the unlearned graph with a marginal decrease in accuracy (less than 5%).

翻译：图神经网络在图数据分析中的兴起及其在机器学习即服务平台上的部署，引发了关于模型训练过程中数据滥用的关键问题。由于本地训练过程缺乏透明度，这种情况进一步恶化，可能导致大量图数据被未经授权积累，从而侵犯数据所有者的知识产权。现有方法通常仅针对数据滥用检测或缓解，且主要面向本地GNN模型而非基于云的MLaaS平台。这些局限性亟需一种全面有效的解决方案，既能检测数据滥用，又能缓解其影响，同时无需依赖精确训练数据，并尊重此类数据的专有性质。本文提出了一项开创性方法GraphGuard以应对这些挑战。我们提出了一种无需训练数据的方法，该方法不仅能检测图数据滥用，还能通过定向遗忘缓解其影响，且全程无需原始训练数据。我们的创新检测技术采用放射性数据成员推理，增强了成员与非成员数据分布间的可区分性。在缓解方面，我们利用合成图模拟目标模型先前学习到的特征，从而在缺少精确图数据的情况下实现有效遗忘。通过使用四个真实图数据集开展综合实验，我们证明了GraphGuard在检测与遗忘两方面的有效性。结果表明，在不同数据集及多种GNN模型上，GraphGuard实现了接近100%的近乎完美检测率。此外，其遗忘操作可消除被遗忘图的影响，且准确率下降幅度极小（低于5%）。