Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.
翻译:故障感知重训练已成为缓解深度神经网络(DNN)硬件加速器中永久性故障的重要技术。然而,重训练会带来巨大开销,特别是当用于微调为解决复杂问题而设计的大规模DNN时。此外,由于每个制造芯片可能具有不同的故障模式,因此需要根据其独特的故障映射对每个芯片单独执行故障感知重训练,这进一步加剧了问题。为降低总体重训练成本,本文引入弹性驱动重训练量选择的概念。为落实这一概念,我们提出一个名为Reduce的新框架,该框架首先计算给定DNN在不同故障率及不同重训练量下对故障的弹性,然后基于弹性,针对每个芯片根据其独特故障映射计算所需的重训练量。我们通过实验验证了所提方法在遭遇计算阵列永久性故障的脉动阵列型DNN加速器上的有效性。