Gradient is All You Need?

In this paper we provide a novel analytical perspective on the theoretical understanding of gradient-based learning algorithms by interpreting consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent. Remarkably, we observe that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior despite solely relying on evaluations of the objective function. The fundamental value of such link between CBO and SGD lies in the fact that CBO is provably globally convergent to global minimizers for ample classes of nonsmooth and nonconvex objective functions, hence, on the one side, offering a novel explanation for the success of stochastic relaxations of gradient descent. On the other side, contrary to the conventional wisdom for which zero-order methods ought to be inefficient or not to possess generalization abilities, our results unveil an intrinsic gradient descent nature of such heuristics. This viewpoint furthermore complements previous insights into the working principles of CBO, which describe the dynamics in the mean-field limit through a nonlinear nonlocal partial differential equation that allows to alleviate complexities of the nonconvex function landscape. Our proofs leverage a completely nonsmooth analysis, which combines a novel quantitative version of the Laplace principle (log-sum-exp trick) and the minimizing movement scheme (proximal iteration). In doing so, we furnish useful and precise insights that explain how stochastic perturbations of gradient descent overcome energy barriers and reach deep levels of nonconvex functions. Instructive numerical illustrations support the provided theoretical insights.

翻译：本文通过将共识优化（CBO）——一种近期提出的多粒子无导数优化方法——解释为梯度下降的随机松弛，为基于梯度的学习算法的理论理解提供了一个新颖的分析视角。值得注意的是，我们观察到，尽管CBO仅依赖目标函数的求值，但通过粒子间的通信，它展现出类似随机梯度下降（SGD）的行为。CBO与SGD之间这种联系的根本价值在于，CBO被证明能全局收敛到广泛非光滑和非凸目标函数的全局极小点，因此，一方面，这为梯度下降的随机松弛的成功提供了新解释。另一方面，与零阶方法应效率低下或不具备泛化能力的传统观点相反，我们的结果揭示了此类启发式方法内在的梯度下降本质。这一观点还补充了先前对CBO工作原理的见解，这些见解通过非线性非局部偏微分方程描述了平均场极限下的动力学，从而能够减轻非凸函数景观的复杂性。我们的证明利用了完全非光滑分析，结合了拉普拉斯原理（对数-求和-指数技巧）的新型定量版本以及最小化移动方案（近端迭代）。通过这样做，我们提供了有用且精确的见解，解释了梯度下降的随机扰动如何克服能量障碍并达到非凸函数的深层。具有启发性的数值例子支持了所提供的理论见解。