Reliability is one of the major design criteria in Cyber-Physical Systems (CPSs). This is because of the existence of some critical applications in CPSs and their failure is catastrophic. Therefore, employing strong error detection and correction mechanisms in CPSs is inevitable. CPSs are composed of a variety of units, including sensors, networks, and microcontrollers. Each of these units is probable to be in a faulty state at any time and the occurred fault can result in erroneous output. The fault may cause the units of CPS to malfunction and eventually crash. Traditional fault-tolerant approaches include redundancy time, hardware, information, and/or software. However, these approaches impose significant overheads besides their low error coverage, which limits their applicability. In addition, the interval between error occurrence and detection is too long in these approaches. In this paper, based on Deep Reinforcement Learning (DRL), a new error detection approach is proposed that not only detects errors with high accuracy but also can perform error detection at the moment due to very low inference time. The proposed approach can categorize different types of errors from normal data and predict whether the system will fail. The evaluation results illustrate that the proposed approach has improved more than 2x in terms of accuracy and more than 5x in terms of inference time compared to other approaches.
翻译:可靠性是信息物理系统(CPS)的主要设计准则之一。这是因为CPS中存在某些关键应用,其故障可能造成灾难性后果。因此,在CPS中采用强效的错误检测与纠正机制是不可避免的。CPS由多种单元组成,包括传感器、网络和微控制器。这些单元在任何时刻都可能处于故障状态,且发生的故障可能导致错误输出。故障可能引发CPS单元功能失常甚至系统崩溃。传统的容错方法包括冗余时间、硬件、信息和/或软件。然而,这些方法在低错误覆盖率之外还引入了显著开销,限制了其适用性。此外,这些方法中错误发生与检测之间的时间间隔过长。本文基于深度强化学习(DRL)提出了一种新的错误检测方法,该方法不仅能够以高精度检测错误,还能凭借极低的推理时间实现即时错误检测。所提方法能够从正常数据中区分不同类型的错误,并预测系统是否将发生故障。评估结果表明,与其他方法相比,本文方法在精度上提升超过2倍,在推理时间上提升超过5倍。