We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $σ^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $τ_{s}$ and $τ_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h σ^2 L Δ}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $Δ= f(x^0) - f^*,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $τ_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $τ_{s} d \frac{L Δ}{\varepsilon}$ and the variance-dependent runtime term $\frac{h σ^2 L Δ}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.
翻译:我们考虑经典联邦学习设置中的集中式分布式优化问题,其中 $n$ 个工作者共同寻找一个 $L$-光滑、$d$ 维非凸函数 $f$ 的 $\varepsilon$-驻点,且仅能访问具有方差 $\sigma^2$ 的无偏随机梯度。每个工作者计算一个随机梯度最多需要 $h$ 秒,从服务器到工作者以及从工作者到服务器的每个坐标通信时间分别为 $\tau_{s}$ 秒和 $\tau_{w}$ 秒。分布式优化的主要动机之一是实现关于 $n$ 的可扩展性。例如,众所周知,SGD 的分布式版本具有与方差相关的运行时间项 $\frac{h \sigma^2 L \Delta}{n \varepsilon^2}$,该项随工作者数量 $n$ 的增加而改善,其中 $\Delta = f(x^0) - f^*$,$x^0 \in R^d$ 为起始点。类似地,使用无偏稀疏化压缩器,可以同时减少与方差相关的运行时间项和通信运行时间项。然而,一旦我们考虑从服务器到工作者的通信 $\tau_{s}$,我们证明无法设计一种使用无偏随机稀疏化压缩器的方法,使得服务器端通信运行时间项 $\tau_{s} d \frac{L \Delta}{\varepsilon}$ 和与方差相关的运行时间项 $\frac{h \sigma^2 L \Delta}{\varepsilon^2}$ 均能实现优于 $n$ 的对数多项式可扩展性,即使在所有工作者访问相同分布的同质(独立同分布)情形下也是如此。为了建立这一结果,我们构造了一个新的“最坏情况”函数,并开发了一个新的下界框架,将分析归结为随机和集中性(我们证明了其集中性界)。这些结果揭示了分布式优化在扩展性上的根本限制,即使在同质假设下也同样成立。