On the Optimal Batch Size for Byzantine-Robust Distributed Learning

Byzantine-robust distributed learning (BRDL), in which computing devices are likely to behave abnormally due to accidental failures or malicious attacks, has recently become a hot research topic. However, even in the independent and identically distributed (i.i.d.) case, existing BRDL methods will suffer from a significant drop on model accuracy due to the large variance of stochastic gradients. Increasing batch sizes is a simple yet effective way to reduce the variance. However, when the total number of gradient computation is fixed, a too-large batch size will lead to a too-small iteration number (update number), which may also degrade the model accuracy. In view of this challenge, we mainly study the optimal batch size when the total number of gradient computation is fixed in this work. In particular, we theoretically and empirically show that when the total number of gradient computation is fixed, the optimal batch size in BRDL increases with the fraction of Byzantine workers. Therefore, compared to the case without attacks, the batch size should be set larger when under Byzantine attacks. However, for existing BRDL methods, large batch sizes will lead to a drop on model accuracy, even if there is no Byzantine attack. To deal with this problem, we propose a novel BRDL method, called Byzantine-robust stochastic gradient descent with normalized momentum (ByzSGDnm), which can alleviate the drop on model accuracy in large-batch cases. Moreover, we theoretically prove the convergence of ByzSGDnm for general non-convex cases under Byzantine attacks. Empirical results show that ByzSGDnm has a comparable performance to existing BRDL methods under bit-flipping failure, but can outperform existing BRDL methods under deliberately crafted attacks.

翻译：拜占庭鲁棒分布式学习（BRDL）中，计算设备可能因意外故障或恶意攻击而表现异常，近年来已成为研究热点。然而，即使在独立同分布（i.i.d.）情况下，由于随机梯度方差较大，现有BRDL方法也会导致模型准确率显著下降。增大批量大小是降低方差的一种简单有效的方法，但当梯度计算总量固定时，过大的批量大小会导致迭代次数（更新次数）过少，也可能降低模型准确率。针对这一挑战，本文主要研究梯度计算总量固定时的最优批量大小。具体而言，我们从理论和实验上证明，当梯度计算总量固定时，BRDL中的最优批量大小会随着拜占庭工作节点比例的增大而增大。因此，与无攻击情况相比，遭受拜占庭攻击时应设置更大的批量大小。然而，对于现有BRDL方法，即使没有拜占庭攻击，大批量大小也会导致模型准确率下降。为解决此问题，我们提出了一种新型BRDL方法——带归一化动量的拜占庭鲁棒随机梯度下降（ByzSGDnm），该方法能缓解大批量情况下的模型准确率下降。此外，我们从理论上证明了ByzSGDnm在拜占庭攻击下对一般非凸情况的收敛性。实验结果表明，ByzSGDnm在比特翻转故障下与现有BRDL方法性能相当，但在精心设计的攻击下优于现有BRDL方法。