Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios. The code is available at https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM.
翻译:分布式随机梯度下降(D-SGD)允许海量设备在没有中央服务器控制的情况下协同学习。然而,现有理论认为分散化必然会损害泛化能力。本文挑战这一传统观点,提出了理解分布式学习的全新视角。我们证明,在一般非凸非$\beta$-光滑条件下,D-SGD隐式最小化了平均方向锐度感知最小化(SAM)算法的损失函数。这一令人惊讶的渐近等价性揭示了内在的正则化-优化权衡以及分散化的三个优势:(1)D-SGD中存在免费的不确定性评估机制,可改进后验估计;(2)D-SGD具有梯度平滑效应;(3)D-SGD的锐度正则化效应不会随总批量增大而减弱,这证明了在大批量场景下D-SGD相对于集中式SGD(C-SGD)的潜在泛化优势。代码见https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM。