Deep Learning (DL) systems have proliferated in many applications, requiring specialized hardware accelerators and chips. In the nano-era, devices have become increasingly more susceptible to permanent and transient faults. Therefore, we need an efficient methodology for analyzing the resilience of advanced DL systems against such faults, and understand how the faults in neural accelerator chips manifest as errors at the DL application level, where faults can lead to undetectable and unrecoverable errors. Using fault injection, we can perform resilience investigations of the DL system by modifying neuron weights and outputs at the software-level, as if the hardware had been affected by a transient fault. Existing fault models reduce the search space, allowing faster analysis, but requiring a-priori knowledge on the model, and not allowing further analysis of the filtered-out search space. Therefore, we propose ISimDL, a novel methodology that employs neuron sensitivity to generate importance sampling-based fault-scenarios. Without any a-priori knowledge of the model-under-test, ISimDL provides an equivalent reduction of the search space as existing works, while allowing long simulations to cover all the possible faults, improving on existing model requirements. Our experiments show that the importance sampling provides up to 15x higher precision in selecting critical faults than the random uniform sampling, reaching such precision in less than 100 faults. Additionally, we showcase another practical use-case for importance sampling for reliable DNN design, namely Fault Aware Training (FAT). By using ISimDL to select the faults leading to errors, we can insert the faults during the DNN training process to harden the DNN against such faults. Using importance sampling in FAT reduces the overhead required for finding faults that lead to a predetermined drop in accuracy by more than 12x.
翻译:深度学习系统已在众多应用领域广泛部署,需要专用硬件加速器与芯片支持。进入纳米时代后,器件对永久性故障和瞬态故障的敏感度显著提升。因此,我们需要一种高效方法来分析先进深度学习系统对此类故障的鲁棒性,并理解神经加速器芯片中的故障如何在深度学习应用层面表现为错误——这类故障可能导致无法检测且不可恢复的误差。通过故障注入技术,我们可在软件层面修改神经元权重与输出,模拟硬件受瞬态故障影响的效果,从而开展深度学习系统的弹性研究。现有故障模型虽能缩减搜索空间以加速分析,但需要预先掌握模型先验知识,且无法对过滤后的搜索空间进行进一步分析。为此,我们提出ISimDL这一创新方法,通过神经元灵敏度生成基于重要性采样的故障场景。无需对待测模型具备任何先验知识,ISimDL既能实现与现有方法等效的搜索空间缩减,又能通过长时模拟覆盖所有可能的故障类型,改进了现有模型的先验需求。实验表明,与随机均匀采样相比,重要性采样在关键故障选择中的精度最高可提升15倍,且仅需不足100个故障即可达到该精度。此外,我们展示了重要性采样在可靠深度神经网络设计中的另一实用案例——故障感知训练。通过ISimDL筛选导致错误的故障,可在深度神经网络训练过程中注入这些故障,从而增强网络对此类故障的鲁棒性。采用重要性采样进行故障感知训练,可将寻找导致预定精度下降故障所需的开销降低12倍以上。