Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.
翻译:局部随机梯度下降是分布式学习中一种流行的优化方法,在实践中常优于包括小批量随机梯度下降在内的其他算法。然而,尽管取得了这一成功,理论上在数据异构性合理的场景下证明局部随机梯度下降的优越性一直困难重重,造成了理论与实践的显著差距。本文中,我们在现有的一阶数据异构性假设下给出了局部随机梯度下降的新下界,表明这些假设不足以证明局部更新步骤的有效性。此外,在相同假设下,我们证明了加速小批量随机梯度下降的极小极大最优性,从而完全解答了若干问题类上分布式优化的理解。我们的研究结果强调了需要建立更好的数据异构性模型来解释局部随机梯度下降在实际中的有效性。为此,我们引入了高阶光滑性与异构性假设,并给出了新的上界,表明当数据异构性较低时,局部随机梯度下降优于小批量随机梯度下降。