Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.
翻译:去中心化随机梯度下降(D-SGD)允许在无中央服务器控制的情况下,在大量设备上同时进行协作学习。然而,现有理论认为去中心化总会损害泛化性能。本文挑战了这一传统认知,并提出理解去中心化学习的全新视角。我们证明,在一般的非凸非-$\beta$-光滑设定下,D-SGD隐式最小化平均方向锐度感知最小化(SAM)算法的损失函数。这一惊人的渐近等价性揭示了一种内在的正则化-优化权衡,以及去中心化的三大优势:(1)D-SGD中存在一种免费的不确定性评估机制,可改进后验估计;(2)D-SGD展现出梯度平滑效应;以及(3)D-SGD的锐度正则化效应不会随总批量大小增加而减弱,这解释了在大批量场景中D-SGD相对于集中式SGD(C-SGD)潜在的泛化优势。