Distributed learning is commonly used for accelerating model training by harnessing the computational capabilities of multiple-edge devices. However, in practical applications, the communication delay emerges as a bottleneck due to the substantial information exchange required between workers and a central parameter server. SignSGD with majority voting (signSGD-MV) is an effective distributed learning algorithm that can significantly reduce communication costs by one-bit quantization. However, due to heterogeneous computational capabilities, it fails to converge when the mini-batch sizes differ among workers. To overcome this, we propose a novel signSGD optimizer with \textit{federated voting} (signSGD-FV). The idea of federated voting is to exploit learnable weights to perform weighted majority voting. The server learns the weights assigned to the edge devices in an online fashion based on their computational capabilities. Subsequently, these weights are employed to decode the signs of the aggregated local gradients in such a way to minimize the sign decoding error probability. We provide a unified convergence rate analysis framework applicable to scenarios where the estimated weights are known to the parameter server either perfectly or imperfectly. We demonstrate that the proposed signSGD-FV algorithm has a theoretical convergence guarantee even when edge devices use heterogeneous mini-batch sizes. Experimental results show that signSGD-FV outperforms signSGD-MV, exhibiting a faster convergence rate, especially in heterogeneous mini-batch sizes.
翻译:分布式学习通常通过利用多个边缘设备的计算能力来加速模型训练。然而在实际应用中,由于工作节点与中央参数服务器之间需要大量信息交换,通信延迟成为瓶颈。基于多数投票的SignSGD算法(signSGD-MV)是一种有效的分布式学习算法,可通过单比特量化显著降低通信成本。但由于计算能力的异构性,当各工作节点的微批次大小不同时,该算法无法收敛。为解决这一问题,我们提出了一种新颖的基于**联邦投票**的signSGD优化器(signSGD-FV)。联邦投票的核心思想是利用可学习权重执行加权多数投票。服务器根据边缘设备的计算能力在线学习分配给它们的权重,并以此解码聚合局部梯度的符号,从而最小化符号解码错误概率。我们构建了统一的收敛率分析框架,可适用于参数服务器对估计权重的认知完美或不完美的场景。理论证明,即使边缘设备使用异构微批次大小,所提出的signSGD-FV算法仍具有收敛保障。实验结果表明,signSGD-FV算法优于signSGD-MV算法,尤其在异构微批次大小场景下表现出更快的收敛速度。