We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by achieving faster convergence and recovering better quality and flatter local optima. We also carry out an ablation study to analyze the scalability of the proposed algorithms in more crowded distributed training environments. Finally, we report that our approach requires less frequent communication and fewer distributed updates compared to the state-of-the-art baselines.
翻译:我们研究了时间受限环境下的深度学习模型分布式训练问题。提出一种新算法,该算法周期性将工作节点拉向以加权平均方式计算的中心变量,其中权重与工作节点的梯度范数成反比,从而优先恢复优化曲面中的平坦区域。我们开发了所提算法的两种异步变体,分别称为模型级与层级梯度加权平均(MGRAWA和LGRAWA),二者的区别在于加权方案是针对整个模型执行还是按层级应用。在理论方面,我们证明了所提方法在凸与非凸设定下的收敛保证。通过实验,我们证明了所提算法在收敛速度、恢复质量更优且更平坦的局部最优解方面均优于对比方法。我们还进行了消融研究,以分析所提算法在更密集分布式训练环境中的可扩展性。最后,我们报告了相较于现有先进基线方法,本方法所需通信频率更低且分布式更新次数更少。