Reliability is one of the major design criteria in Cyber-Physical Systems (CPSs). This is because of the existence of some critical applications in CPSs and their failure is catastrophic. Therefore, employing strong error detection and correction mechanisms in CPSs is inevitable. CPSs are composed of a variety of units, including sensors, networks, and microcontrollers. Each of these units is probable to be in a faulty state at any time and the occurred fault can result in erroneous output. The fault may cause the units of CPS to malfunction and eventually crash. Traditional fault-tolerant approaches include redundancy time, hardware, information, and/or software. However, these approaches impose significant overheads besides their low error coverage, which limits their applicability. In addition, the interval between error occurrence and detection is too long in these approaches. In this paper, based on Deep Reinforcement Learning (DRL), a new error detection approach is proposed that not only detects errors with high accuracy but also can perform error detection at the moment due to very low inference time. The proposed approach can categorize different types of errors from normal data and predict whether the system will fail. The evaluation results illustrate that the proposed approach has improved more than 2x in terms of accuracy and more than 5x in terms of inference time compared to other approaches.
翻译:可靠性是网络-物理系统(CPS)的主要设计标准之一。这是因为CPS中存在一些关键应用,其故障可能导致灾难性后果。因此,在CPS中采用强大的错误检测与纠正机制是不可避免的。CPS由传感器、网络和微控制器等多种单元组成。每个单元都可能在任意时刻处于故障状态,而发生的故障可能导致错误输出。该故障可能使CPS单元功能失常,甚至最终崩溃。传统的容错方法包括时间冗余、硬件冗余、信息冗余和/或软件冗余。然而,这些方法除了错误覆盖率低外,还会引入显著的开销,从而限制了其适用性。此外,这些方法中错误发生与检测之间的时间间隔过长。本文基于深度强化学习(DRL)提出了一种新的错误检测方法,该方法不仅能以高精度检测错误,还能凭借极低的推理时间实现即时错误检测。所提方法能够从正常数据中区分不同类型的错误,并预测系统是否会发生故障。评估结果表明,与其他方法相比,所提方法在准确率上提升了超过2倍,在推理时间上提升了超过5倍。