Graph neural networks (GNNs) have recently emerged as a promising learning paradigm in learning graph-structured data and have demonstrated wide success across various domains such as recommendation systems, social networks, and electronic design automation (EDA). Like other deep learning (DL) methods, GNNs are being deployed in sophisticated modern hardware systems, as well as dedicated accelerators. However, despite the popularity of GNNs and the recent efforts of bringing GNNs to hardware, the fault tolerance and resilience of GNNs have generally been overlooked. Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy. By developing a customized fault injection tool on top of PyTorch, we perform extensive fault injection experiments on various GNN models and application datasets. We observe that the error resilience of GNN models varies by orders of magnitude with respect to different models and application datasets. Further, we explore a low-cost error mitigation mechanism for GNN to enhance its resilience. This GNN resilience study aims to open up new directions and opportunities for future GNN accelerator design and architectural optimization.
翻译:图神经网络(GNN)近年来已成为学习图结构数据的一种新兴学习范式,并在推荐系统、社交网络和电子设计自动化(EDA)等多个领域展现出广泛成功。与其他深度学习方法类似,GNN正被部署在先进的现代硬件系统以及专用加速器上。然而,尽管GNN备受关注且近期有将其落地于硬件的研究努力,但GNN的容错性和鲁棒性普遍被忽视。受深度学习方法固有算法鲁棒性的启发,本文首次开展大规模实证研究,旨在探究硬件故障与GNN准确性之间的关系。通过在PyTorch上开发定制化故障注入工具,我们对多种GNN模型及应用数据集进行了大量故障注入实验。观察发现,GNN模型的错误鲁棒性在不同模型和应用数据集上存在量级差异。此外,我们探索了一种低成本的GNN错误缓解机制以增强其鲁棒性。本项GNN鲁棒性研究旨在为未来GNN加速器设计与架构优化开辟新方向与机遇。