We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
翻译:我们研究深度神经网络的集中式分布式数据并行训练,旨在改善局部梯度方法中通信效率与模型性能之间的权衡。为此,我们重新审视了平坦极小值假设,该假设表明泛化能力更强的模型往往位于损失景观的平坦区域。我们引入了一种简单而有效的尖锐性度量——逆均值谷,并证明了其与深度神经网络泛化差距的强相关性。我们将该度量的高效松弛形式作为轻量正则化项纳入分布式训练目标中,鼓励工作节点协同寻找宽极小值。该正则化项产生一种推力,抵消了将工作节点拉拢在一起的共识步骤,由此引出分布式推拉力量(DPPF)算法。实验表明,DPPF优于其他通信高效方法,并在保持通信效率的同时,取得了比局部梯度方法和同步梯度平均更好的泛化性能。此外,我们的损失景观可视化证实了DPPF定位更平坦极小值的能力。理论方面,我们证明了DPPF引导工作节点跨越平坦谷,最终谷宽由推力和拉力强度的相互作用决定,且其推拉动力学具有自稳定性。我们进一步给出了与谷宽相关的泛化保证,并在非凸设置下证明了收敛性。