We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the iteration complexity from $O(k)$ to $O(k^{1/2})$, where $k$ is the condition number. No momentum or other algorithmic modifications are required. This result is incomparable to the (deterministic) Silver Stepsize Schedule which does not require separability but only achieves partial acceleration $O(k^{\log_{1+\sqrt{2}} 2}) \approx O(k^{0.78})$. Our starting point is a conceptual connection to potential theory: the variational characterization for the distribution of stepsizes with fastest convergence rate mirrors the variational characterization for the distribution of charged particles with minimal logarithmic potential energy. The Arcsine distribution solves both variational characterizations due to a remarkable "equalization property" which in the physical context amounts to a constant potential over space, and in the optimization context amounts to an identical convergence rate over all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.
翻译:我们证明,对于可分离凸优化问题,随机步长能够完全加速梯度下降算法。具体而言,采用从反正弦分布独立同分布抽取的逆步长,可将迭代复杂度从$O(k)$提升至$O(k^{1/2})$,其中$k$为条件数。该加速效果无需动量项或其他算法修改即可实现。此结果与(确定性)银步长调度方案具有不可比性:后者虽不要求可分离性,但仅能实现部分加速$O(k^{\log_{1+\sqrt{2}} 2}) \approx O(k^{0.78})$。我们的研究起点源于与势能理论的概念联系:具有最快收敛速率的步长分布之变分特征,与具有最小对数势能的带电粒子分布之变分特征具有对应关系。反正弦分布通过其卓越的"均衡化特性"同时满足这两个变分特征——在物理语境中该特性表现为空间中的恒定势能,在优化语境中则表现为所有二次函数上完全一致的收敛速率。关键技术洞见在于:鞅论证将这一现象推广至所有可分离凸函数。我们将此均衡化解释为对冲的极端形式:通过采用这种随机步长分布,梯度下降算法在函数类中的所有函数上均能以完全相同的速率收敛。