Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a "good" set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and 100x faster encoding compared to the available methods.
翻译:滞后者节点是分布式矩阵计算中众所周知的瓶颈,会导致计算/通信速度的降低。缓解此类滞后者的常见策略是在框架中引入基于里德-所罗门的MDS(最大距离可分)码;这能够实现对最优数量滞后者的弹性。然而,这些码给工作节点分配了子矩阵的稠密线性组合。当输入矩阵稀疏时,这些方法会增加编码矩阵中非零项的数量,进而对工作节点计算时间产生不利影响。在这项工作中,我们开发了一种分布式矩阵计算方法,其中分配的编码子矩阵是少量子矩阵的随机线性组合。除了非常适合稀疏输入矩阵之外,我们的方法在一定的问题参数范围内仍具有最优的滞后者弹性。此外,与最近的稀疏矩阵计算方法相比,在我们的方法中搜索“好”的随机系数以促进数值稳定性在计算上更加高效。我们表明,我们的方法能够有效利用异构系统中较慢工作节点完成的局部计算,这可以提升整体计算速度。通过亚马逊网络服务(AWS)进行的数值实验表明,与现有方法相比,每个工作节点计算时间减少了高达30%,编码速度提升了100倍。