In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{\eta^2}$ as the learning rate $\eta$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.
翻译:在采用数据并行方式的分布式深度学习中,每个训练步骤同步梯度会导致巨大的通信开销,尤其是在多个节点协同训练大规模模型时。局部梯度方法(如局部随机梯度下降)通过允许工作节点在不与其他节点同步的情况下本地计算$H$步来解决此问题,从而降低通信频率。虽然$H$一直被视为权衡优化效率与通信成本的学习率超参数,但近期研究表明,设置合适的$H$值有助于提升泛化性能。然而,如何选取合适的$H$值仍难以捉摸。本文提出一种基于理论的方法来确定$H$,即二次同步规则(Quadratic Synchronization Rule, QSR),该方法建议将$H$动态设置为与$\frac{1}{\eta^2}$成比例,随着学习率$\eta$随时间衰减而变化。在ImageNet数据集上对ResNet和ViT进行的大量实验表明,采用QSR的局部梯度方法在测试准确率上始终优于其他同步策略。与标准数据并行训练相比,QSR使采用局部AdamW优化器的ViT-B模型在16或64个GPU上的训练时间分别从26.7小时降至20.2小时、从8.6小时降至5.5小时,同时将Top-1验证准确率分别提高$1.16\%$或$0.84\%$。