Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.
翻译:局部随机梯度下降(Local SGD)是针对大规模训练的一种通信高效型随机梯度下降(SGD)变体,其中多个GPU独立执行SGD并定期平均模型参数。近期研究发现,Local SGD不仅能实现降低通信开销的设计目标,还能比对应的SGD基线获得更高的测试准确率(Lin等,2020b),尽管实现这一效果的训练机制仍存在争议(Ortiz等,2021)。本文旨在基于随机微分方程(SDE)近似,理解Local SGD为何(以及何时)能实现更好的泛化性能。主要贡献包括:(i) 推导出描述小学习率机制下Local SGD长期行为的SDE,揭示噪声如何在迭代接近局部极小值流形后驱动其漂移与扩散;(ii) 比较Local SGD与SGD的SDE,证明Local SGD引入更强的漂移项,可产生更显著的正则化效应,例如更快的尖锐度衰减;(iii) 通过实证验证表明,小学习率与足够长的训练时间能使泛化性能优于SGD,而移除任一条件则无法实现改进。