eFAT: Improving the Effectiveness of Fault-Aware Training for Mitigating Permanent Faults in DNN Hardware Accelerators

Fault-Aware Training (FAT) has emerged as a highly effective technique for addressing permanent faults in DNN accelerators, as it offers fault mitigation without significant performance or accuracy loss, specifically at low and moderate fault rates. However, it leads to very high retraining overheads, especially when used for large DNNs designed for complex AI applications. Moreover, as each fabricated chip can have a distinct fault pattern, FAT is required to be performed for each faulty chip individually, considering its unique fault map, which further aggravates the problem. To reduce the overheads of FAT while maintaining its benefits, we propose (1) the concepts of resilience-driven retraining amount selection, and (2) resilience-driven grouping and fusion of multiple fault maps (belonging to different chips) to perform consolidated retraining for a group of faulty chips. To realize these concepts, in this work, we present a novel framework, eFAT, that computes the resilience of a given DNN to faults at different fault rates and with different levels of retraining, and it uses that knowledge to build a resilience map given a user-defined accuracy constraint. Then, it uses the resilience map to compute the amount of retraining required for each chip, considering its unique fault map. Afterward, it performs resilience and reward-driven grouping and fusion of fault maps to further reduce the number of retraining iterations required for tuning the given DNN for the given set of faulty chips. We demonstrate the effectiveness of our framework for a systolic array-based DNN accelerator experiencing permanent faults in the computational array. Our extensive results for numerous chips show that the proposed technique significantly reduces the retraining cost when used for tuning a DNN for multiple faulty chips.

翻译：容错训练（Fault-Aware Training, FAT）已成为应对DNN加速器中永久性故障的高效技术，因其能在低至中等故障率下实现故障缓解，且不会导致显著的性能或精度损失。然而，该技术会导致极高的重训练开销，尤其是在为复杂AI应用设计的大型DNN中。此外，由于每个制造芯片可能具有不同的故障模式，FAT需针对每个故障芯片单独执行，并考虑其独特的故障映射图，这进一步加剧了问题。为在保持FAT优势的同时降低其开销，我们提出：（1）基于弹性的重训练量选择概念；（2）基于弹性的多芯片故障映射分组与融合策略，以实现对一组故障芯片的联合重训练。为落实这些概念，本文提出全新框架eFAT，该框架可计算给定DNN在不同故障率及不同重训练层级下的故障弹性，并基于用户定义的精度约束构建弹性映射图。随后，利用弹性映射图结合各芯片独有的故障映射图，计算所需的重训练量。最后，通过弹性与奖励驱动的故障映射分组与融合，进一步减少为给定故障芯片集合调优DNN所需的重训练迭代次数。我们在基于脉动阵列的DNN加速器（其计算阵列存在永久性故障）上验证了框架的有效性。针对大量芯片的广泛结果表明，所提技术在为多个故障芯片调优DNN时，能显著降低重训练成本。