Recent advances in deep learning have produced highly accurate but increasingly large and complex DNNs, making traditional fault-injection techniques impractical. Accurate fault analysis requires RTL-accurate hardware models. However, this significantly slows evaluation compared with software-only approaches, particularly when combined with expensive HDL instrumentation. In this work, we show that such high-overhead methods are unnecessary for systolic array (SA) architectures and propose ENFOR-SA, an end-to-end framework for DNN transient fault analysis on SAs. Our two-step approach employs cross-layer simulation and uses RTL SA components only during fault injection, with the rest executed at the software level. Experiments on CNNs and Vision Transformers demonstrate that ENFOR-SA achieves RTL-accurate fault injection with only 6% average slowdown compared to software-based injection, while delivering at least two orders of magnitude speedup (average $569\times$) over full-SoC RTL simulation and a $2.03\times$ improvement over a state-of-the-art cross-layer RTL injection tool. ENFOR-SA code is publicly available at https://github.com/rafaabt/ENFOR-SA.
翻译:深度学习的最新进展催生了高精度但日益庞大复杂的深度神经网络(DNN),使得传统故障注入技术难以适用。精确的故障分析需要寄存器传输级(RTL)精度的硬件模型。然而,与纯软件方法相比,这显著降低了评估速度,尤其是在结合昂贵的硬件描述语言(HDL)插桩时。本工作表明,对于脉动阵列(SA)架构,此类高开销方法并非必需,并提出了ENFOR-SA,一个面向SA上DNN瞬态故障分析的端到端框架。我们的两步法采用跨层仿真,仅在故障注入阶段使用RTL级SA组件,其余部分在软件层面执行。在卷积神经网络(CNN)和视觉Transformer上的实验表明,ENFOR-SA实现了RTL精度的故障注入,与基于软件的注入相比平均仅带来6%的速度损失,同时相比全片上系统(SoC)RTL仿真获得了至少两个数量级的加速(平均$569\times$),相比最先进的跨层RTL注入工具也有$2.03\times$的性能提升。ENFOR-SA代码已公开于 https://github.com/rafaabt/ENFOR-SA。