We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
翻译:本研究聚焦于深度神经网络(DNN)的中心化分布式数据并行训练,旨在改善本地梯度方法在通信效率与模型性能之间的权衡。为此,我们重新审视平坦最小值假说,该假说认为具有更好泛化能力的模型往往位于损失景观中更平坦的区域。我们引入了一种简单而有效的锐度度量——逆均值谷度,并证明了其与DNN泛化差距的强相关性。我们将该度量的高效松弛形式纳入分布式训练目标中,作为一种轻量级正则化项,促使各工作节点协作寻找宽最小值。该正则化项产生一种推力,与将工作节点拉向一致的共识步骤形成对抗,从而形成了分布式拉-推力(DPPF)算法。实验表明,DPPF在通信效率上优于其他高效通信方法,同时在泛化性能上超越了本地梯度方法与同步梯度平均法,并保持了通信效率。此外,我们的损失景观可视化结果证实了DPPF定位更平坦最小值的能力。在理论层面,我们证明了DPPF能够引导工作节点跨越平坦谷地,最终谷宽由推力与拉力的相互作用所决定,并且其拉-推动力学具有自稳定特性。我们进一步提供了与谷宽相关的泛化保证,并在非凸设定下证明了算法的收敛性。