Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...

翻译：梯度编码方案通过在编码的局部偏导数中对所有模型参数引入相同冗余，有效缓解了分布式学习中的完全慢节点问题。然而，这类方案无法利用部分慢节点产生的不完整计算结果，因此对部分慢节点不再有效。本文旨在设计一种新型梯度编码方案以缓解分布式学习中的部分慢节点问题。具体而言，我们考虑一个包含一个主节点和N个工作节点的分布式系统，该系统采用通用部分慢节点模型，并聚焦于使用梯度编码解决含L个模型参数的通用大规模机器学习问题。首先，我们提出一种坐标梯度编码方案，其中L个编码参数分别表示L个坐标的可能不同冗余度，该方案可生成大多数梯度编码方案。随后，我们考虑针对L个坐标编码参数的最小化期望总运行时间与最大化完成概率问题，这些问题属于具有挑战性的离散优化问题。为降低计算复杂度，我们首先将每个问题转化为等价但更简单的离散问题，该问题使用N≪L个变量表示将L个坐标划分为N个块，每个块具有相同冗余度。这等价于一种更易实现的块坐标梯度编码方案，该方案对每个块使用N个编码参数。接着，我们采用连续松弛方法进一步降低计算复杂度。针对转化后的期望总运行时间最小化问题，我们开发了一种计算复杂度为O(N²)的迭代算法以获取最优解，并推导出两种计算复杂度均为O(N)的闭式近似解。针对转化后的完成概率最大化问题，我们开发了一种迭代算法...