Modern deep neural networks often require distributed training with many workers due to their large size. As worker numbers increase, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient methods with per-iteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only syncing after several local steps. Despite understanding their convergence in i.i.d. and heterogeneous settings and knowing the importance of batch sizes for efficiency and generalization, optimal local batch sizes are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification experiments, demonstrating the effectiveness of our strategies in training and generalization.
翻译:现代深度神经网络因其规模庞大,通常需要多节点分布式训练。随着节点数量的增加,在采用每轮梯度同步的数据并行小批量随机梯度方法中,通信开销成为主要瓶颈。诸如局部随机梯度下降(Local SGD)等局部梯度方法通过仅在若干局部步骤后进行同步来减少通信。尽管已理解其在独立同分布和异构设置下的收敛性,且已知批量大小对效率和泛化的重要性,但最优的局部批量大小仍难以确定。本文为局部梯度方法引入了自适应批量大小策略,通过自适应增大批量大小来降低小批量梯度方差。我们在同质数据条件下提供了收敛性保证,并通过图像分类实验验证了所提策略在训练和泛化方面的有效性。