Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architecture is an attractive solution for training Graph Neural Networks (GNNs) on edge platforms. However, the immature fabrication process and limited write endurance of ReRAMs make them prone to hardware faults, thereby limiting their widespread adoption for GNN training. Further, the existing fault-tolerant solutions prove inadequate for effectively training GNNs in the presence of faults. In this paper, we propose a fault-aware framework referred to as FARe that mitigates the effect of faults during GNN training. FARe outperforms existing approaches in terms of both accuracy and timing overhead. Experimental results demonstrate that FARe framework can restore GNN test accuracy by 47.6% on faulty ReRAM hardware with a ~1% timing overhead compared to the fault-free counterpart.
翻译:基于电阻式随机存取存储器(ReRAM)的处理中存储(PIM)架构是在边缘平台上训练图神经网络(GNN)的一种有吸引力的解决方案。然而,ReRAM器件不成熟的制造工艺和有限的写入耐久性使其容易出现硬件故障,从而限制了其在GNN训练中的广泛采用。此外,现有的容错解决方案在存在故障的情况下无法有效训练GNN。本文提出了一种称为FARe的容错感知框架,可在GNN训练期间缓解故障的影响。FARe在准确性和时间开销方面均优于现有方法。实验结果表明,与无故障情况相比,FARe框架可在故障ReRAM硬件上将GNN测试准确率恢复47.6%,且时间开销仅为约1%。