Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richt\'{a}rik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak's momentum.
翻译:由于在分布式环境中训练机器学习模型时存在较高的通信开销,现代算法不可避免地依赖于有损通信压缩。然而,若不加以处理,压缩引起的误差会传播,并可能导致严重的不稳定行为,包括指数发散。大约十年前,Seide等人[2014]提出了一种称为EF14的误差反馈机制,作为缓解该问题的极其有效的启发式方法。然而,尽管在过去十年中EF领域在算法和理论方面取得了稳步进展,我们的理解仍远未完善。在本工作中,我们解决了最紧迫的问题之一。具体而言,在经典的非常凸设定下,所有已知的EF变体都依赖非常大的批次大小才能收敛,这在实践中可能难以承受。我们提出了一个令人惊讶的简单修复方法,该方法在理论上和实践中均消除了这一问题:即对Richtárik等人[2021]的最新EF变体EF21应用Polyak动量。我们称该算法为EF21-SGDM,它在标准平滑性和有界方差假设下改进了先前误差反馈算法的通信和样本复杂度,并且不需要任何进一步强假设(如梯度差异有界)。此外,我们提出了该方法的双动量版本,进一步提升了复杂度。即使从方法中移除压缩,我们的证明似乎也是新颖的,因此,我们的证明技术在富含Polyak动量的非常凸随机优化研究中具有独立价值。