Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.
翻译:去中心化随机梯度下降(D-SGD)允许多设备在没有中央服务器控制的情况下同时进行协作学习。然而,现有理论认为去中心化会不可避免地损害泛化性能。本文质疑这一传统观点,并提出理解去中心化学习的全新视角。我们证明,在一般非凸非-$\beta$-光滑设定下,D-SGD隐式最小化了平均方向锐度感知最小化(SAM)算法的损失函数。这一令人惊讶的渐近等价性揭示了内在的正则化-优化权衡以及去中心化的三个优势:(1)D-SGD中存在一种免费的不确定性评估机制,可改进后验估计;(2)D-SGD展现出梯度平滑效应;(3)D-SGD的锐度正则化效应不会随总批量大小增加而减弱,这佐证了在大批量场景下D-SGD相对于集中式SGD(C-SGD)可能存在的泛化优势。