A Systematic Literature Review on Hardware Reliability Assessment Methods for Deep Neural Networks

Artificial Intelligence (AI) and, in particular, Machine Learning (ML) have emerged to be utilized in various applications due to their capability to learn how to solve complex problems. Over the last decade, rapid advances in ML have presented Deep Neural Networks (DNNs) consisting of a large number of neurons and layers. DNN Hardware Accelerators (DHAs) are leveraged to deploy DNNs in the target applications. Safety-critical applications, where hardware faults/errors would result in catastrophic consequences, also benefit from DHAs. Therefore, the reliability of DNNs is an essential subject of research. In recent years, several studies have been published accordingly to assess the reliability of DNNs. In this regard, various reliability assessment methods have been proposed on a variety of platforms and applications. Hence, there is a need to summarize the state of the art to identify the gaps in the study of the reliability of DNNs. In this work, we conduct a Systematic Literature Review (SLR) on the reliability assessment methods of DNNs to collect relevant research works as much as possible, present a categorization of them, and address the open challenges. Through this SLR, three kinds of methods for reliability assessment of DNNs are identified including Fault Injection (FI), Analytical, and Hybrid methods. Since the majority of works assess the DNN reliability by FI, we characterize different approaches and platforms of the FI method comprehensively. Moreover, Analytical and Hybrid methods are propounded. Thus, different reliability assessment methods for DNNs have been elaborated on their conducted DNN platforms and reliability evaluation metrics. Finally, we highlight the advantages and disadvantages of the identified methods and address the open challenges in the research area.

翻译：人工智能（AI），特别是机器学习（ML），因其解决复杂问题的学习能力，已被广泛应用于各类应用场景。过去十年间，ML的快速发展催生了由大量神经元和层组成的深度神经网络（DNN）。深度神经网络硬件加速器（DHA）被用于在目标应用中部署DNN。在安全关键型应用中（硬件故障/错误可能导致灾难性后果），DHA同样发挥着重要作用。因此，DNN的可靠性成为重要研究课题。近年来，已有多项研究致力于评估DNN的可靠性，并在不同平台和应用中提出了多种可靠性评估方法。据此，需要总结当前研究现状以识别DNN可靠性研究中的空白。本研究对DNN可靠性评估方法进行了系统性文献综述（SLR），旨在尽可能全面地收集相关研究成果，对其进行分类，并探讨未解决的挑战。通过本次SLR，识别出三种DNN可靠性评估方法：故障注入（FI）方法、分析方法与混合方法。由于多数研究通过FI评估DNN可靠性，我们全面描述了FI方法的不同途径与平台特征，并进一步阐述了分析与混合方法。据此，从DNN平台及可靠性评估指标维度，详细阐述了不同的DNN可靠性评估方法。最后，我们强调所识别方法的优缺点，并指出现有研究领域中的未解决挑战。