Communication-Efficient Approximate Gradient Coding

from arxiv, Submitted to IEEE Transactions on Information Theory. This paper was presented in part at the IEEE International Symposium on Information Theory (ISIT), Ann Arbor, MI, USA, 2025

Large-scale distributed learning aims at minimizing a loss function $L$ that depends on a training dataset with respect to a $d$-length parameter vector. The distributed cluster typically consists of a parameter server (PS) and multiple workers. Gradient coding is a technique that makes the learning process resilient to straggling workers. It introduces redundancy within the assignment of data points to the workers and uses coding theoretic ideas so that the PS can recover $\nabla L$ exactly or approximately, even in the presence of stragglers. Communication-efficient gradient coding allows the workers to communicate vectors of length smaller than $d$ to the PS, thus reducing the communication time. While there have been schemes that address the exact recovery of $\nabla L$ within communication-efficient gradient coding, to the best of our knowledge the approximate variant has not been considered in a systematic manner. In this work we present constructions of communication-efficient approximate gradient coding schemes. Our schemes use structured matrices that arise from bipartite graphs, combinatorial designs and strongly regular graphs, along with randomization and algebraic constraints. We derive analytical upper bounds on the approximation error of our schemes that are tight in certain cases. Moreover, we derive a corresponding worst-case lower bound on the approximation error of any scheme. For a large class of our methods, under reasonable probabilistic worker failure models, we show that the expected value of the computed gradient equals the true gradient. This in turn allows us to prove that the learning algorithm converges to a stationary point over the iterations. Numerical experiments corroborate our theoretical findings.

翻译：大规模分布式学习旨在最小化一个依赖于训练数据集的损失函数 $L$，其中该函数与一个 $d$ 维参数向量相关。分布式集群通常由一个参数服务器（PS）和多个工作节点组成。梯度编码是一种使学习过程对掉队工作节点具有鲁棒性的技术。它在将数据点分配给工作节点时引入冗余，并利用编码理论思想，使得PS即使在存在掉队节点的情况下也能精确或近似地恢复出 $\nabla L$。通信高效的梯度编码允许工作节点向PS传输长度小于 $d$ 的向量，从而减少通信时间。尽管已有方案在通信高效梯度编码中实现了 $\nabla L$ 的精确恢复，但据我们所知，其近似变体尚未得到系统性的研究。在本文中，我们提出了通信高效近似梯度编码方案的构造方法。我们的方案使用了源自二分图、组合设计和强正则图的结构化矩阵，并结合了随机化和代数约束。我们推导出了这些方案近似误差的分析上界，该上界在某些情况下是紧的。此外，我们还推导了任意方案近似误差的相应最坏情况下的下界。对于我们所提出的一类广泛方法，在合理的概率性工作节点故障模型下，我们证明了计算梯度的期望值等于真实梯度。这进而使我们能够证明学习算法在迭代过程中收敛到一个驻点。数值实验验证了我们的理论发现。