QR decomposition is an essential operation for solving linear equations and obtaining least-squares solutions. In high-performance computing systems, large-scale parallel QR decomposition often faces node faults. We address this issue by proposing a fault-tolerant algorithm that incorporates `coded computing' into the parallel Gram-Schmidt method, commonly used for QR decomposition. Coded computing introduces error-correcting codes into computational processes to enhance resilience against intermediate failures. While traditional coding strategies cannot preserve the orthogonality of $Q$, recent work has proven a post-orthogonalization condition that allows low-cost restoration of the degraded orthogonality. In this paper, we construct a checksum-generator matrix for multiple-node failures that satisfies the post-orthogonalization condition and prove that our code satisfies the maximum-distance separable (MDS) property with high probability. Furthermore, we consider in-node checksum storage setting where checksums are stored in original nodes. We obtain the minimal number of checksums required to be resilient to any $f$ failures under the in-node checksum storage, and also propose an in-node systematic MDS coding strategy that achieves the lower bound. Extensive experiments validate our theories and showcase the negligible overhead of our coded computing framework for fault-tolerant QR decomposition.
翻译:QR分解是求解线性方程组和获得最小二乘解的基本运算。在高性能计算系统中,大规模并行QR分解常面临节点故障问题。我们针对该问题提出了一个容错算法,将"编码计算"融入常用于QR分解的并行Gram-Schmidt方法中。编码计算在计算过程中引入纠错码,以增强对中间故障的弹性。虽然传统编码策略无法保持$Q$的正交性,但近期工作证明了一个后正交化条件,允许以低成本恢复退化的正交性。本文针对多节点故障构建了满足后正交化条件的校验和生成矩阵,并证明我们的编码以高概率满足最大距离可分(MDS)性质。此外,我们考虑节点内校验和存储设置(校验和存储在原始节点中)。我们获得了节点内校验和存储下对任意$f$个故障具有弹性所需的最小校验和数量,并提出了一种达到该下界的节点内系统化MDS编码策略。大量实验验证了我们的理论,并展示了编码计算框架在容错QR分解中的可忽略开销。