Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.

翻译：误差反馈（Error Feedback, EF）是一种广泛流行且极为有效的机制，用于解决分布式训练方法（如分布式梯度下降或分布式随机梯度下降）在采用TopK等贪婪通信压缩技术时出现的收敛性问题。尽管EF于近十年前提出（Seide等，2014），且学术界已集中努力推进该机制的理论理解，但仍有许多待探索之处。本文研究了一种现代形式的误差反馈——EF21（Richtarik等，2021），它在最弱假设下提供当前最优的理论保证，并在实践中表现出色。具体而言，尽管EF21的理论通信复杂度依赖于特定平滑参数的二次均值，我们将其改进为算术均值，后者总是更小，且在异构数据场景中可能显著更小。我们带领读者踏上发现之旅：从将EF21应用于底层问题的等效重新表述（不幸地需要通常不切实际的机器克隆）开始，继而在无需任何克隆的情况下发现一种新的加权版EF21（幸运地可执行），最终回归对原始EF21方法的改进分析。虽然这一进展适用于EF21的最简形式，但我们的方法自然扩展至涉及随机梯度和部分参与等更复杂的变体。此外，我们的技术改进了EF21在稀有特征场景下（Richtarik等，2023）的最优理论结果。最后，我们通过相应实验验证了理论发现。