永久性故障下神经网络加速器可持续重用的算法策略 (Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults)

Hardware failures are a growing challenge for machine learning accelerators, many of which are based on systolic arrays. When a permanent hardware failure occurs in a systolic array, existing solutions include localizing and isolating the faulty processing element (PE), using a redundant PE for re-execution, or in some extreme cases decommissioning the entire accelerator for further investigation. In this paper, we propose novel algorithmic approaches that mitigate permanent hardware faults in neural network (NN) accelerators by uniquely integrating the behavior of the faulty component instead of bypassing it. In doing so, we aim for a more sustainable use of the accelerator where faulty hardware is neither bypassed nor discarded, instead being given a second life. We first introduce a CUDA-accelerated systolic array simulator in PyTorch, which enabled us to quantify the impact of permanent faults appearing on links connecting two PEs or in weight registers, where one bit is stuck at 0 or 1 in the float32, float16, or bfloat16 representation. We then propose several algorithmic mitigation techniques for a subset of stuck-at faults, such as Invertible Scaling or Shifting of activations and weights, or fine tuning with the faulty behavior. Notably, the proposed techniques do not require any hardware modification, instead relying on existing components of widely used systolic array based accelerators, such as normalization, activation, and storage units. Extensive experimental evaluations using fully connected and convolutional NNs trained on MNIST, CIFAR-10 and ImageNet show that the proposed fault-tolerant approach matches or gets very close to the original fault-free accuracy.

翻译：硬件故障正日益成为机器学习加速器面临的一大挑战，这些加速器大多基于脉动阵列。当脉动阵列发生永久性硬件故障时，现有解决方案包括定位并隔离故障处理单元（PE）、使用冗余PE重新执行，或在某些极端情况下停用整个加速器以进行深入分析。本文提出了一种新颖的算法方法，通过独特地整合故障组件的行为而非绕过它，来缓解神经网络（NN）加速器中的永久性硬件故障。这样做的目标是实现加速器的可持续利用，故障硬件既不被绕过也不被丢弃，而是获得第二次生命。我们首先在PyTorch中引入了一个CUDA加速的脉动阵列模拟器，使我们能够量化永久性故障的影响，这些故障出现在连接两个PE的链路上或权重寄存器中，其中在float32、float16或bfloat16表示中有一位被固定为0或1。然后，我们针对部分固定型故障提出了几种算法缓解技术，例如激活和权重的可逆缩放或移位，或在故障行为下进行微调。值得注意的是，所提出的技术不需要任何硬件修改，而是依赖于广泛使用的基于脉动阵列的加速器的现有组件，如归一化、激活和存储单元。使用在MNIST、CIFAR-10和ImageNet上训练的全连接和卷积神经网络进行的广泛实验评估表明，所提出的容错方法达到或非常接近原始无故障的准确率。

相关内容

Neural Networks

关注 1652

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日