In this work, we establish the linear convergence estimate for the gradient descent involving the delay $\tau\in\mathbb{N}$ when the cost function is $\mu$-strongly convex and $L$-smooth. This result improves upon the well-known estimates in Arjevani et al. \cite{ASS} and Stich-Karmireddy \cite{SK} in the sense that it is non-ergodic and is still established in spite of weaker constraint of cost function. Also, the range of learning rate $\eta$ can be extended from $\eta\leq 1/(10L\tau)$ to $\eta\leq 1/(4L\tau)$ for $\tau =1$ and $\eta\leq 3/(10L\tau)$ for $\tau \geq 2$, where $L >0$ is the Lipschitz continuity constant of the gradient of cost function. In a further research, we show the linear convergence of cost function under the Polyak-{\L}ojasiewicz\,(PL) condition, for which the available choice of learning rate is further improved as $\eta\leq 9/(10L\tau)$ for the large delay $\tau$. The framework of the proof for this result is also extended to the stochastic gradient descent with time-varying delay under the PL condition. Finally, some numerical experiments are provided in order to confirm the reliability of the analyzed results.
翻译:本文研究了成本函数满足$\mu$-强凸且$L$-光滑时,含时延$\tau\in\mathbb{N}$的梯度下降法的线性收敛估计。与Arjevani等人\cite{ASS}及Stich-Karmireddy\cite{SK}的经典估计相比,本结果具有非遍历性,且能在成本函数约束更弱的情形下成立。此外,学习率$\eta$的取值范围可从$\eta\leq 1/(10L\tau)$扩展至:当$\tau=1$时$\eta\leq 1/(4L\tau)$,当$\tau\geq 2$时$\eta\leq 3/(10L\tau)$,其中$L>0$为成本函数梯度的Lipschitz连续性常数。进一步研究表明,在Polyak-Łojasiewicz(PL)条件下,成本函数具有线性收敛性,且当延迟$\tau$较大时,学习率的可行选择可优化至$\eta\leq 9/(10L\tau)$。该结果的证明框架还可推广至PL条件下含时变延迟的随机梯度下降法。最后,通过数值实验验证了理论分析的可靠性。