Modern deep neural networks often require distributed training with many workers due to their large size. As the number of workers increases, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient methods with per-iteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only synchronizing model parameters and/or gradients after several local steps. Despite an understanding of their convergence and the importance of batch sizes for training efficiency and generalization, optimal batch sizes for local gradient methods are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification and language modeling experiments, demonstrating the effectiveness of our strategies for both training efficiency and generalization.
翻译:现代深度神经网络因其庞大的规模,通常需要借助多节点进行分布式训练。随着节点数量的增加,在数据并行的迷你批随机梯度方法中,每轮迭代的梯度同步所产生的通信开销成为主要性能瓶颈。诸如本地随机梯度下降(Local SGD)等本地梯度方法通过仅在若干本地训练步骤后才同步模型参数和/或梯度,有效降低了通信开销。尽管学界已对其收敛性有所理解,且批处理规模对训练效率和泛化性能至关重要,但为本地梯度方法确定最优批处理规模仍具挑战。本文针对本地梯度方法提出自适应批处理规模策略,通过自适应增大批处理规模以降低迷你批梯度方差。我们在同质数据条件下提供了收敛性理论保证,并通过图像分类与语言建模实验验证了所提策略在提升训练效率与泛化性能方面的有效性。