This paper considers the problem of distributed learning (DL) in the presence of stragglers. For this problem, DL methods based on gradient coding have been widely investigated, which redundantly distribute the training data to the workers to guarantee convergence when some workers are stragglers. However, these methods require the workers to transmit real-valued vectors during the process of learning, which induces very high communication burden. To overcome this drawback, we propose a novel DL method based on 1-bit gradient coding (1-bit GCDL), where 1-bit data encoded from the locally computed gradients are transmitted by the workers to reduce the communication overhead. We theoretically provide the convergence guarantees of the proposed method for both the convex loss functions and nonconvex loss functions. It is shown empirically that 1-bit GC-DL outperforms the baseline methods, which attains better learning performance under the same communication overhead.
翻译:本文研究了存在掉队者情况下的分布式学习问题。针对该问题,基于梯度编码的分布式学习方法已被广泛研究,这类方法通过将训练数据冗余分配给工作节点来确保部分节点掉队时的收敛性。然而,这些方法要求工作节点在学习过程中传输实值向量,导致极高的通信负担。为克服这一缺陷,我们提出了一种基于1比特梯度编码的新型分布式学习方法(1-bit GC-DL),该方法中工作节点传输由本地计算梯度编码得到的1比特数据以降低通信开销。我们从理论上保证了该方法在凸损失函数和非凸损失函数下的收敛性。实验表明,在相同通信开销下,1-bit GC-DL方法优于基准方法,取得了更好的学习性能。