Although adaptive gradient methods have been extensively used in deep learning, their convergence rates have not been thoroughly studied, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}^TE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}}{T^{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of the optimization variable and $T$ is the iteration number. Since $\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum_{k=1}^TE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{1}{T^{1/4}})$ one of SGD measured by $\ell_1$ norm.
翻译:尽管自适应梯度方法已广泛应用于深度学习,但其收敛率尚未得到充分研究,特别是关于维度依赖性的问题。本文考虑经典的RMSProp及其动量扩展,在无界梯度假设下,建立了以$\ell_1$范数度量的收敛率$\frac{1}{T}\sum_{k=1}^TE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}}{T^{1/4}})$,其中$d$为优化变量的维数,$T$为迭代次数。对于维度$d$极大的问题,由于$\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2$,我们的收敛率可视为与SGD的$\ell_1$范数度量结果$\frac{1}{T}\sum_{k=1}^TE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{1}{T^{1/4}})$类似。