We provide the first proof that gradient descent $\left({\color{green}\sf GD}\right)$ with greedy sparsification $\left({\color{green}\sf TopK}\right)$ and error feedback $\left({\color{green}\sf EF}\right)$ can obtain better communication complexity than vanilla ${\color{green}\sf GD}$ when solving the distributed optimization problem $\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}$, where $n$ = # of clients, $d$ = # of features, and $f_1,\dots,f_n$ are smooth nonconvex functions. Despite intensive research since 2014 when ${\color{green}\sf EF}$ was first proposed by Seide et al., this problem remained open until now. We show that ${\color{green}\sf EF}$ shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector $\hat{x}$ such that $\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon$ in expectation, ${\color{green}\sf GD}$ with the ${\color{green}\sf Top1}$ sparsifier and ${\color{green}\sf EF}$ requires ${\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right)$ bits to be communicated by each worker to the server only, where $L$ is the smoothness constant of $f$, $L_i$ is the smoothness constant of $f_i$, ${\color{red}c}$ is the maximal number of clients owning any feature ($1\leq {\color{red}c} \leq n$), and ${\color{blue}r}$ is the maximal number of features owned by any client ($1\leq {\color{blue}r} \leq d$). Clearly, the communication complexity improves as ${\color{red}c}$ decreases (i.e., as features become more rare), and can be much better than the ${\cal O}({\color{blue}r} L \frac{1}{\varepsilon})$ communication complexity of ${\color{green}\sf GD}$ in the same regime.
翻译:本文首次证明,当使用贪心稀疏化($\left({\color{green}\sf TopK}\right)$)和误差反馈($\left({\color{green}\sf EF}\right)$)时,梯度下降法($\left({\color{green}\sf GD}\right)$)在求解分布式优化问题$\min_{x\in \mathbb{R}^d} {f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)}$中可获得优于原始${\color{green}\sf GD}$的通信复杂度,其中$n$表示客户端数量,$d$表示特征数量,且$f_1,\dots,f_n$为光滑非凸函数。尽管自2014年Seide等人首次提出${\color{green}\sf EF}$以来相关研究日益深入,但该问题此前始终未能解决。我们证明${\color{green}\sf EF}$在特征稀少的场景下(即每个特征仅出现在少量客户端所拥有的数据中)表现尤为突出。为阐释主要结论,我们证明:在寻找满足期望条件$\lVert {\nabla f(\hat{x})} \rVert^2 \leq \varepsilon$的随机向量$\hat{x}$时,采用${\color{green}\sf Top1}$稀疏化器和${\color{green}\sf EF}$的${\color{green}\sf GD}$算法仅需每个工作节点向服务器传输${\cal O} \left(\left( L+{\color{blue}r} \sqrt{ \frac{{\color{red}c}}{n} \min \left( \frac{{\color{red}c}}{n} \max_i L_i^2, \frac{1}{n}\sum_{i=1}^n L_i^2 \right) }\right) \frac{1}{\varepsilon} \right)$比特信息,其中$L$为$f$的光滑常数,$L_i$为$f_i$的光滑常数,${\color{red}c}$为拥有任意特征的最大客户端数($1\leq {\color{red}c} \leq n$),${\color{blue}r}$为任意客户端拥有的最大特征数($1\leq {\color{blue}r} \leq d$)。显然,通信复杂度随${\color{red}c}$减小(即特征愈发稀少)而优化,在相同场景下可显著优于${\color{green}\sf GD}$算法${\cal O}({\color{blue}r} L \frac{1}{\varepsilon})$的通信复杂度。