Hardware failures are a growing challenge for machine learning accelerators, many of which are based on systolic arrays. When a permanent hardware failure occurs in a systolic array, existing solutions include localizing and isolating the faulty processing element (PE), using a redundant PE for re-execution, or in some extreme cases decommissioning the entire accelerator for further investigation. In this paper, we propose novel algorithmic approaches that mitigate permanent hardware faults in neural network (NN) accelerators by uniquely integrating the behavior of the faulty component instead of bypassing it. In doing so, we aim for a more sustainable use of the accelerator where faulty hardware is neither bypassed nor discarded, instead being given a second life. We first introduce a CUDA-accelerated systolic array simulator in PyTorch, which enabled us to quantify the impact of permanent faults appearing on links connecting two PEs or in weight registers, where one bit is stuck at 0 or 1 in the float32, float16, or bfloat16 representation. We then propose several algorithmic mitigation techniques for a subset of stuck-at faults, such as Invertible Scaling or Shifting of activations and weights, or fine tuning with the faulty behavior. Notably, the proposed techniques do not require any hardware modification, instead relying on existing components of widely used systolic array based accelerators, such as normalization, activation, and storage units. Extensive experimental evaluations using fully connected and convolutional NNs trained on MNIST, CIFAR-10 and ImageNet show that the proposed fault-tolerant approach matches or gets very close to the original fault-free accuracy.
翻译:硬件故障正日益成为机器学习加速器面临的一大挑战,这些加速器大多基于脉动阵列。当脉动阵列发生永久性硬件故障时,现有解决方案包括定位并隔离故障处理单元(PE)、使用冗余PE重新执行,或在某些极端情况下停用整个加速器以进行深入分析。本文提出了一种新颖的算法方法,通过独特地整合故障组件的行为而非绕过它,来缓解神经网络(NN)加速器中的永久性硬件故障。这样做的目标是实现加速器的可持续利用,故障硬件既不被绕过也不被丢弃,而是获得第二次生命。我们首先在PyTorch中引入了一个CUDA加速的脉动阵列模拟器,使我们能够量化永久性故障的影响,这些故障出现在连接两个PE的链路上或权重寄存器中,其中在float32、float16或bfloat16表示中有一位被固定为0或1。然后,我们针对部分固定型故障提出了几种算法缓解技术,例如激活和权重的可逆缩放或移位,或在故障行为下进行微调。值得注意的是,所提出的技术不需要任何硬件修改,而是依赖于广泛使用的基于脉动阵列的加速器的现有组件,如归一化、激活和存储单元。使用在MNIST、CIFAR-10和ImageNet上训练的全连接和卷积神经网络进行的广泛实验评估表明,所提出的容错方法达到或非常接近原始无故障的准确率。