In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H\"older regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.
翻译:本文研究了压缩对机器学习中随机梯度算法的影响——该技术广泛应用于分布式与联邦学习。我们揭示了多种满足相同方差条件的无偏压缩算子之间收敛率的差异,从而超越了经典最坏情况分析。为此,聚焦于最小二乘回归情形,我们分析了基于随机场最小化二次函数的一般随机逼近算法。针对该分析,我们对随机场(具体而言是期望Hölder正则性)和噪声协方差采用弱假设,从而能够分析包括压缩在内的各种随机化机制。随后将结果扩展至联邦学习场景。更形式化地,我们凸显了算法诱导的加性噪声协方差$\mathfrak{C}_{\mathrm{ania}}$对收敛的影响。我们证明尽管随机场非正则,极限方差项按$\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$(其中$H$为优化问题的Hessian矩阵,$K$为迭代次数)标度,推广了经典最小二乘回归情形的收敛率$\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$(Bach和Moulines, 2013)。进而分析$\mathfrak{C}_{\mathrm{ania}}$对压缩策略的依赖性及其对收敛的影响——首先在集中式情形,随后在两个异构联邦学习框架中。