Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.
翻译:矩阵计算是边缘计算系统的基础构建模块,近年来由于其在人工智能/机器学习训练和推理过程中的应用,需求显著增长。现有的分布式矩阵计算方法通常将子矩阵的编码组合分配给工作节点,以应对速度较慢的节点(称为拖后腿节点)带来的挑战。然而,在边缘学习场景中,这些方法会破坏边缘服务器原始矩阵中普遍存在的稀疏特性。本研究致力于增强此类方法,使其在跨边缘设备分配计算任务时能够保持输入矩阵的稀疏性,从而保留相关的计算效率优势。首先,我们推导了编码权重(即获得编码子矩阵所需组合的子矩阵数量)的下界,以提供对最大可能数量拖后腿设备的容错能力(在给定设备数量及其存储约束条件下)。接着,我们提出了满足编码权重精确下界的分布式矩阵计算方案。在亚马逊网络服务(AWS)平台上进行的数值实验验证了我们关于稀疏矩阵的拖后腿节点缓解效果与计算速度的论断。