We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow. 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$. 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(t\phi(t))$ for any function $\phi$ which satisfies $\lim_{t\to\infty}\phi(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.
翻译:我们研究凸目标函数的梯度流/梯度下降和重球/加速梯度下降优化。在梯度流情形下,我们证明以下结论:1. 若函数$f$无最小化点,则收敛速度$f(x_t)\to \inf f$可任意缓慢。2. 若函数$f$存在最小化点,则过剩能量$f(x_t) - \inf f$在时间上可积/可求和。特别地,当$t\to\infty$时,$f(x_t) - \inf f = o(1/t)$。3. 在希尔伯特空间中,此结果最优:即使对于固定的二次目标函数,$f(x_t) - \inf f$的衰减速度可任意接近任何单调递减且在无穷远处可积的函数。4. 在有限维空间(或更一般地,对所有有限长度梯度流曲线)中,此结果并非最优:我们证明存在凸单调递减可积函数$g(t)$,其衰减至零的速度慢于$\mathbb R^d$上任意凸函数的梯度流所对应的$f(x_t)-\inf f$。例如,我们证明有限维空间中凸函数$f$的任意梯度流$x_t$满足$\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$。该结论改进了常见的$O(1/t)$速率,并给出了能量衰减律的精确刻画。同时我们指出,对于任意满足$\lim_{t\to\infty}\phi(t) = \infty$的函数$\phi$,即使从渐近意义上看,也无法建立$O(1/(t\phi(t))$的衰减速率。在相关设置中获得类似结论的场景包括:(1) 离散时间梯度下降;(2) 含乘性噪声的随机梯度下降;(3) 重球常微分方程。对于随机梯度下降情形,我们利用$\mathbb E[f(x_n) - \inf f]$的可和性证明$f(x_n)\to \inf f$几乎必然收敛——这是对基于$O(1/n)$衰减估计所得出的几乎必然依子序列收敛结果的改进。