The proliferation of demanding applications and edge computing establishes the need for an efficient management of the underlying computing infrastructures, urging the providers to rethink their operational methods. In this paper, we propose an Intelligent Proactive Fault Tolerance (IPFT) method that leverages the edge resource usage predictions through Recurrent Neural Networks (RNN). More specifically, we focus on the process-faults, which are related with the inability of the infrastructure to provide Quality of Service (QoS) in acceptable ranges due to the lack of processing power. In order to tackle this challenge we propose a composite deep learning architecture that predicts the resource usage metrics of the edge nodes and triggers proactive node replications and task migration. Taking also into consideration that the edge computing infrastructure is also highly dynamic and heterogeneous, we propose an innovative Hybrid Bayesian Evolution Strategy (HBES) algorithm for automated adaptation of the resource usage models. The proposed resource usage prediction mechanism has been experimentally evaluated and compared with other state of the art methods with significant improvements in terms of Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Additionally, the IPFT mechanism that leverages the resource usage predictions has been evaluated in an extensive simulation in CloudSim Plus and the results show significant improvement compared to the reactive fault tolerance method in terms of reliability and maintainability.
翻译:随着高需求应用和边缘计算的普及,对底层计算基础设施进行高效管理的需求日益凸显,促使服务提供商重新审视其运维策略。本文提出了一种智能主动容错(IPFT)方法,该方法通过循环神经网络(RNN)利用边缘资源使用预测。具体而言,我们聚焦于过程故障——此类故障因处理能力不足导致基础设施无法在可接受范围内提供服务质量(QoS)。为应对这一挑战,我们设计了一种复合深度学习架构,用于预测边缘节点的资源使用指标,并触发主动节点复制及任务迁移。同时,考虑到边缘计算基础设施的高度动态性与异构性,我们提出了一种创新的混合贝叶斯进化策略(HBES)算法,用于资源使用模型的自动化自适应调整。所提出的资源使用预测机制经过实验评估,并与现有先进方法进行了对比,在均方根误差(RMSE)和平均绝对误差(MAE)方面均实现了显著改进。此外,在CloudSim Plus平台上的广泛仿真实验表明,与反应式容错方法相比,基于资源使用预测的IPFT机制在可靠性和可维护性方面均取得了显著提升。