We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.
翻译:我们首次为归一化误差反馈算法在广泛的机器学习问题上提供了收敛性证明。尽管误差反馈算法在训练深度神经网络中广受欢迎且高效,但其传统分析所依赖的光滑性假设并不能捕捉这些优化问题中目标函数的特性。相反,这些问题最近被证明满足广义光滑性假设,而在此类假设下对误差反馈算法的理论理解在很大程度上仍未得到探索。此外,据我们所知,所有在广义光滑性下的现有分析要么 i) 聚焦于单节点设置,要么 ii) 对分布式设置做出了不切实际的强假设,例如要求数据异质性以及几乎必然有界的随机梯度噪声方差。在本文中,我们提出了利用归一化的分布式误差反馈算法,在广义光滑性下对非凸问题实现了$O(1/\sqrt{K})$的收敛速率。我们的分析适用于无数据异质性条件的分布式设置,并允许独立于问题参数进行步长调整。此外,我们为随机设置下的归一化误差反馈算法提供了强收敛性保证。最后,我们证明,由于允许更大的步长,我们提出的新归一化误差反馈算法在多项任务上(包括多项式函数最小化、逻辑回归以及ResNet-20训练)的表现优于其非归一化版本。