Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, where data can be collected by streamlined computing devices that are physically located near instruments of interest, has received tremendous interest in recent years. Such edge computing environments can operate on data in-situ instead of requiring the collection of data in HPC and/or CC facilities, offering enticing benefits that include avoiding costs of transmission over potentially unreliable or slow networks, increased data privacy, and real-time data analysis. Before such benefits can be realized at scale, new fault tolerant approaches must be developed to address the inherent unreliability of edge computing environments, because the traditional resilience approaches used by HPC and CC are not generally applicable to edge computing. Those traditional approaches commonly utilize checkpoint-and-restart and/or redundant-computation strategies that are not feasible for edge computing environments where data storage is limited and synchronization is expensive. Motivated by prior algorithm-based fault tolerance approaches, a variant of the asynchronous Jacobi (ASJ) method is developed herein with resilience to data corruption achieved by leveraging existing convergence theory. The proposed ASJ variant rejects solution approximations from neighbor devices if the distance between two successive approximations violates an analytic bound. Numerical results show the ASJ variant restores convergence in the presence of certain types of natural and malicious data corruption.
翻译:近年来,将科学计算从高性能计算和云计算环境迁移至边缘设备(即部署在目标仪器附近的精简计算设备进行数据采集)引起了广泛关注。此类边缘计算环境可在数据产生地直接处理数据,无需将数据集中至HPC/CC设施,因此具有显著优势:避免通过不可靠或低速网络传输的成本、增强数据隐私保护以及实现实时数据分析。然而,在规模化推广之前,必须开发新型容错方法以应对边缘计算环境固有的不可靠性,因为传统HPC/CC采用的恢复策略(如检查点重启与冗余计算)在存储受限且同步代价高昂的边缘场景中难以适用。受基于算法的容错方法启发,本文提出一种异步雅可比(ASJ)方法的改进变体,通过利用现有收敛理论实现数据损坏的鲁棒性。该变体通过分析连续两次迭代近似解之间的偏离程度,在违反解析边界时拒绝来自相邻节点的解近似值。数值实验表明,该ASJ变体可在自然与恶意两类数据损坏场景下恢复收敛特性。