In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.
翻译:在分布式计算中,较慢的节点(滞后者)通常成为瓶颈。由Tandon等人提出的梯度编码(GC)是一种高效技术,它利用纠错编码原理在存在滞后者的环境中分发梯度计算。本文考虑序列梯度$\{g(1),g(2),\ldots,g(J)\}$的分布式计算,其中每个梯度$g(t)$的处理始于第$t$轮,并在第$(t+T)$轮前完成。参数$T\geq 0$表示延迟。对于GC方案,编码仅跨计算节点进行,得到$T=0$的解决方案。另一方面,取$T>0$允许设计利用时间维度的方案。本文提出两种相较于GC性能更优的方案。第一种方案将GC与先前未完成任务的选择性重复相结合,实现了增强的滞后者缓解效果。第二种方案(构成我们的主要贡献)对任务子集应用GC,并对剩余任务进行重复处理。随后,我们基于历史滞后者模式,以自适应方式将这两类任务在工作者和轮次间进行复用。通过理论分析,我们证明第二种方案能显著降低计算负载。在实验中,我们研究了在包含256个工作者节点的AWS Lambda集群上同步训练多个神经网络的实用场景(该框架自然适用)。结果表明,在存在自然发生(非模拟)滞后者的环境中,后者方案相较于基线GC方案可提升16%的运行时间效率。