Neural networks (NNs) are increasingly used in always-on safety-critical applications deployed on hardware accelerators (NN-HAs) employing various memory technologies. Reliable continuous operation of NN is essential for safety-critical applications. During online operation, NNs are susceptible to single and multiple permanent and soft errors due to factors such as radiation, aging, and thermal effects. Explicit NN-HA testing methods cannot detect transient faults during inference, are unsuitable for always-on applications, and require extensive test vector generation and storage. Therefore, in this paper, we propose the \emph{uncertainty fingerprint} approach representing the online fault status of NN. Furthermore, we propose a dual head NN topology specifically designed to produce uncertainty fingerprints and the primary prediction of the NN in \emph{a single shot}. During the online operation, by matching the uncertainty fingerprint, we can concurrently self-test NNs with up to $100\%$ coverage with a low false positive rate while maintaining a similar performance of the primary task. Compared to existing works, memory overhead is reduced by up to $243.7$ MB, multiply and accumulate (MAC) operation is reduced by up to $10000\times$, and false-positive rates are reduced by up to $89\%$.
翻译:神经网络(NN)越来越多地应用于采用各种存储技术的硬件加速器(NN-HA)中的常开安全关键型应用。对于安全关键型应用,NN的可靠连续运行至关重要。在在线运行期间,由于辐射、老化和热效应等因素,NN容易受到单个和多个永久性及软错误的影响。显式的NN-HA测试方法无法检测推理过程中的瞬时故障,不适用于常开应用,并且需要大量的测试向量生成和存储。因此,本文提出了一种表示NN在线故障状态的\emph{不确定性指纹}方法。此外,我们提出了一种双头NN拓扑结构,专门设计用于在\emph{单次执行}中产生不确定性指纹和NN的主要预测。在在线运行期间,通过匹配不确定性指纹,我们可以并发自测试NN,覆盖率达到$100\%$,误报率低,同时保持主要任务的相似性能。与现有工作相比,内存开销减少高达243.7 MB,乘累加(MAC)操作减少高达$10000\times$,误报率减少高达$89\%$。