Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of $\nabla L$ (gradient of the loss function $L$). However, in large-scale clusters, many workers are slower than their promised speed or even failure-prone. A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover $\nabla L$ (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. In this work, we present novel gradient coding protocols that judiciously leverage the work performed by partial stragglers. Our protocols are efficient from a computation and communication perspective and numerically stable. For an important class of chunk assignments, we present efficient algorithms for optimizing the relative ordering of chunks within the workers; this ordering affects the overall execution time. For exact gradient reconstruction, our protocol is around $2\times$ faster than the original class of protocols and for approximate gradient reconstruction, the mean-squared-error of our reconstructed gradient is several orders of magnitude better.
翻译:在分布式学习中,工作者通常计算其分配的数据块上的梯度,并将其发送至参数服务器(PS),由参数服务器聚合这些梯度以计算 $\nabla L$(损失函数 $L$ 的梯度)的精确或近似版本。然而,在大规模集群中,许多工作者的实际计算速度低于其承诺速度,甚至容易发生故障。梯度编码解决方案通过在数据块分配中引入冗余,并利用编码理论思想,使得参数服务器即使在存在掉队者的情况下也能恢复 $\nabla L$(精确或近似地)。遗憾的是,现有的大多数梯度编码协议从计算角度来看效率低下,因为它们粗略地将工作者划分为正常运行或故障两类;慢速工作者(部分掉队者)所执行的潜在有价值工作被忽略了。在本工作中,我们提出了新颖的梯度编码协议,能够审慎地利用部分掉队者所完成的工作。我们的协议在计算和通信方面高效,且数值稳定。针对一类重要的数据块分配方案,我们提出了优化工作者内部数据块相对顺序的高效算法;这种顺序会影响整体执行时间。对于精确梯度重建,我们的协议比原始协议类别快约 $2\times$;对于近似梯度重建,我们重建梯度的均方误差改善了数个数量级。