Deep Learning (DL) systems have proliferated in many applications, requiring specialized hardware accelerators and chips. In the nano-era, devices have become increasingly more susceptible to permanent and transient faults. Therefore, we need an efficient methodology for analyzing the resilience of advanced DL systems against such faults, and understand how the faults in neural accelerator chips manifest as errors at the DL application level, where faults can lead to undetectable and unrecoverable errors. Using fault injection, we can perform resilience investigations of the DL system by modifying neuron weights and outputs at the software-level, as if the hardware had been affected by a transient fault. Existing fault models reduce the search space, allowing faster analysis, but requiring a-priori knowledge on the model, and not allowing further analysis of the filtered-out search space. Therefore, we propose ISimDL, a novel methodology that employs neuron sensitivity to generate importance sampling-based fault-scenarios. Without any a-priori knowledge of the model-under-test, ISimDL provides an equivalent reduction of the search space as existing works, while allowing long simulations to cover all the possible faults, improving on existing model requirements. Our experiments show that the importance sampling provides up to 15x higher precision in selecting critical faults than the random uniform sampling, reaching such precision in less than 100 faults. Additionally, we showcase another practical use-case for importance sampling for reliable DNN design, namely Fault Aware Training (FAT). By using ISimDL to select the faults leading to errors, we can insert the faults during the DNN training process to harden the DNN against such faults. Using importance sampling in FAT reduces the overhead required for finding faults that lead to a predetermined drop in accuracy by more than 12x.
翻译:深度学习系统已在众多应用中得到广泛部署,这需要专用硬件加速器和芯片的支持。在纳米时代,设备对永久性故障和瞬态故障的敏感性日益增加。因此,我们需要一种高效的方法来分析先进深度学习系统对此类故障的弹性,并理解神经加速器芯片中的故障如何在深度学习应用层面表现为错误——这些故障可能导致不可检测且无法恢复的错误。通过故障注入,我们可以在软件层面修改神经元权重和输出,模拟硬件受到瞬态故障影响,从而对深度学习系统进行弹性研究。现有故障模型虽然通过缩小搜索空间实现了更快速的分析,但需要预先了解模型知识,且无法对已过滤的搜索空间进行进一步分析。为此,我们提出ISimDL——一种利用神经元敏感性生成基于重要性采样的故障场景的新方法。ISimDL无需对待测模型有任何先验知识,即可在实现与现有工作等效的搜索空间缩减的同时,通过长时模拟覆盖所有可能的故障,从而改进了现有模型的要求。实验表明,重要性采样在关键故障筛选精度上比随机均匀采样最高提升15倍,且仅需不到100个故障即可达到该精度。此外,我们展示了重要性采样在可靠DNN设计中的另一实用场景——故障感知训练(FAT)。通过使用ISimDL选择导致错误的故障,我们可在DNN训练过程中注入这些故障以增强DNN对此类故障的鲁棒性。在FAT中采用重要性采样,可将寻找导致预定精度下降的故障所需开销降低12倍以上。