In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( \delta L \exp \left[-\frac{\mu K}{\delta L}\right] + \frac{(C + \delta D)}{K\mu}\right)$, where $\delta\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
翻译:近年来,各种通信压缩技术已成为缓解分布式学习中通信瓶颈不可或缺的工具。然而,尽管在实际应用中,有偏压缩器往往比研究更深入、理解更透彻的无偏压缩器表现出更优越的性能,但对其认知仍十分有限。本研究探讨了三类有偏压缩算子(其中两类为全新定义)在(随机)梯度下降及分布式(随机)梯度下降中的应用性能。我们首次证明,有偏压缩器在单节点与分布式场景中均能实现线性收敛速率。研究表明,采用误差反馈机制的分布式压缩SGD方法享有遍历速率$O\left( \delta L \exp \left[-\frac{\mu K}{\delta L}\right] + \frac{(C + \delta D)}{K\mu}\right)$,其中压缩参数$\delta\ge 1$随压缩强度增大而增大,$L$和$\mu$分别为光滑性和强凸性常数,$C$表征随机梯度噪声(若每个节点计算全梯度则$C=0$),$D$刻画最优解处的梯度方差(过参数化模型$D=0$)。进一步通过多组合成与经验分布的梯度传输理论分析,我们揭示了有偏压缩器优于无偏变体的根本原因及性能提升幅度。最后,我们提出若干兼具理论保障与实践效能的新型有偏压缩器。