EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization

In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck and gradient compression is widely used to reduce the number of bits sent within each communication round of iterative methods. There are two classes of compression operators and separate algorithms making use of them. In the case of unbiased random compressors with bounded variance (e.g., rand-k), the DIANA algorithm of Mishchenko et al. (2019), which implements a variance reduction technique for handling the variance introduced by compression, is the current state of the art. In the case of biased and contractive compressors (e.g., top-k), the EF21 algorithm of Richt\'arik et al. (2021), which instead implements an error-feedback mechanism, is the current state of the art. These two classes of compression schemes and algorithms are distinct, with different analyses and proof techniques. In this paper, we unify them into a single framework and propose a new algorithm, recovering DIANA and EF21 as particular cases. Our general approach works with a new, larger class of compressors, which has two parameters, the bias and the variance, and includes unbiased and biased compressors as particular cases. This allows us to inherit the best of the two worlds: like EF21 and unlike DIANA, biased compressors, like top-k, whose good performance in practice is recognized, can be used. And like DIANA and unlike EF21, independent randomness at the compressors allows to mitigate the effects of compression, with the convergence rate improving when the number of parallel workers is large. This is the first time that an algorithm with all these features is proposed. We prove its linear convergence under certain conditions. Our approach takes a step towards better understanding of two so-far distinct worlds of communication-efficient distributed learning.

翻译：在分布式或联邦优化与学习中，不同计算单元间的通信常成为瓶颈，梯度压缩被广泛用于减少迭代方法每轮通信中的比特数。现有两类压缩算子及其对应算法：其一为有界方差无偏随机压缩器（如rand-k）场景，Mishchenko等人（2019）提出的DIANA算法通过方差缩减技术处理压缩引入的方差，是目前最先进的方法；其二为有偏收缩压缩器（如top-k）场景，Richtárik等人（2021）提出的EF21算法采用误差反馈机制，同样代表当前最优水平。这两类压缩方案与算法在分析与证明技术上截然不同。本文将其统一至单一框架，提出新算法，并表明DIANA与EF21均为其特例。我们的通用方法适用于一类更广义的新型压缩器——该类压缩器包含偏差与方差两个参数，可涵盖无偏与有偏压缩器作为特例。这使得算法兼具两类优势：如同EF21（区别于DIANA），可实际应用表现优异的有偏压缩器（如top-k）；又如DIANA（区别于EF21），压缩器的独立随机性可缓解压缩影响，且收敛速度随并行工作者数量增加而提升。这是首个具备全部上述特征的算法。我们证明其在一定条件下呈线性收敛。该工作为理解此前独立发展的两类通信高效分布式学习方法迈出了重要一步。