Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.
翻译:分层SGD(H-SGD)已成为多级通信网络中一种新的分布式SGD算法。在H-SGD中,每次全局聚合前,工作节点将其更新的局部模型发送至局部服务器进行聚合。尽管已有研究努力,局部聚合对全局收敛的影响仍缺乏理论理解。本文首先引入“向上”和“向下”散度的新概念,随后利用其对具有非独立同分布数据、非凸目标函数和随机梯度的两层H-SDG进行新颖分析,得到最坏情况收敛上界。通过将该结果扩展至随机分组情形,我们观察到H-SGD的收敛上界介于两个单层局部SGD设置的上界之间,其中局部迭代次数分别等于H-SGD中的局部和全局更新周期,我们称之为“三明治行为”。此外,我们基于“向上”和“向下”散度的分析方法被扩展至研究超过两层的H-SGD一般情况下的收敛性,此时“三明治行为”仍然成立。我们的理论结果揭示了为何局部聚合有助于改善H-SGD收敛性的关键见解。