Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning: Technical Report

Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...

翻译：梯度编码方案通过在编码后的局部偏导数中对所有模型参数引入相同的冗余，有效缓解了分布式学习中的完全拖后节点问题。然而，这些方案无法应对部分拖后节点，因其无法利用部分拖后节点产生的不完整计算结果。本文旨在设计一种新的梯度编码方案以缓解分布式学习中的部分拖后节点问题。具体而言，我们考虑一个包含一个主节点和N个工作节点的分布式系统，该系统具有通用的部分拖后节点模型，并专注于解决一个包含L个模型参数的通用大规模机器学习问题。首先，我们提出一种坐标梯度编码方案，该方案具有L个编码参数，分别代表L个坐标可能不同的多样性，从而生成大多数梯度编码方案。随后，我们考虑最小化期望总运行时间以及最大化完成概率这两个关于L个坐标编码参数的优化问题，它们属于具有挑战性的离散优化问题。为降低计算复杂度，我们首先将每个问题转化为等价但更简单的离散问题，其中包含N << L个变量，代表将L个坐标划分为N个块，每个块具有相同的冗余度。这等价于一种更易实现的块坐标梯度编码方案，其中包含N个块的编码参数。然后，我们采用连续松弛方法进一步降低计算复杂度。对于所得到的期望总运行时间最小化问题，我们开发了一种计算复杂度为O(N^2)的迭代算法来获得最优解，并推导出两种计算复杂度均为O(N)的闭式近似解。对于所得到的完成概率最大化问题，我们开发了一种迭代算法...