Communication compression is a common technique in distributed optimization that can alleviate communication overhead by transmitting compressed gradients and model parameters. However, compression can introduce information distortion, which slows down convergence and incurs more communication rounds to achieve desired solutions. Given the trade-off between lower per-round communication costs and additional rounds of communication, it is unclear whether communication compression reduces the total communication cost. This paper explores the conditions under which unbiased compression, a widely used form of compression, can reduce the total communication cost, as well as the extent to which it can do so. To this end, we present the first theoretical formulation for characterizing the total communication cost in distributed optimization with communication compression. We demonstrate that unbiased compression alone does not necessarily save the total communication cost, but this outcome can be achieved if the compressors used by all workers are further assumed independent. We establish lower bounds on the communication rounds required by algorithms using independent unbiased compressors to minimize smooth convex functions, and show that these lower bounds are tight by refining the analysis for ADIANA. Our results reveal that using independent unbiased compression can reduce the total communication cost by a factor of up to $\Theta(\sqrt{\min\{n, \kappa\}})$, where $n$ is the number of workers and $\kappa$ is the condition number of the functions being minimized. These theoretical findings are supported by experimental results.
翻译:通信压缩是分布式优化中常用的一种技术,通过传输压缩后的梯度和模型参数来降低通信开销。然而,压缩会引入信息失真,从而减缓收敛速度,并需要更多通信轮次才能达到期望解。考虑到每轮通信成本降低与额外通信轮次之间的权衡,通信压缩是否能够减少总通信成本尚不明确。本文探讨了无偏压缩(一种广泛使用的压缩形式)在何种条件下能够降低总通信成本,以及其降低的程度。为此,我们首次提出了一种理论框架,用于刻画带有通信压缩的分布式优化中的总通信成本。我们证明,单独使用无偏压缩并不一定能节省总通信成本,但如果所有工作者使用的压缩器进一步被假设为独立,则可以实现这一目标。我们建立了使用独立无偏压缩器的算法最小化光滑凸函数所需通信轮次的下界,并通过改进ADIANA的分析表明这些下界是紧的。我们的结果表明,使用独立无偏压缩可以将总通信成本降低最多$\Theta(\sqrt{\min\{n, \kappa\}})$倍,其中$n$是工作者数量,$\kappa$是被最小化函数的条件数。这些理论发现得到了实验结果的支撑。