Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a ``good'' set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and 100x faster encoding compared to the available methods.
翻译:节点延迟是分布式矩阵计算的典型瓶颈,会导致计算/通信速度下降。为缓解此类延迟问题,常用策略是引入基于里德-所罗门的MDS(最大距离可分)码框架,这能实现对最优数量延迟节点的容错能力。然而,这些码会给工作节点分配子矩阵的密集线性组合。当输入矩阵稀疏时,这类方法会显著增加编码矩阵中非零元素的数量,从而对工作节点的计算时间产生负面影响。本研究提出一种分布式矩阵计算方法,其中分配的子矩阵编码为少量子矩阵的随机线性组合。该方法除适用于稀疏输入矩阵外,在一定参数范围内仍保持最优的延迟节点容错性。同时,相较于近期稀疏矩阵计算方法,本方法在搜索"良好"随机系数以提升数值稳定性方面具有更高的计算效率。我们证明,该方法能有效利用异构系统中较慢工作节点的部分计算结果,从而提升整体计算速度。通过亚马逊云服务(AWS)进行的数值实验表明,与现有方法相比,每个工作节点的计算时间最多可减少30%,编码速度提升100倍。