Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge--constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.
翻译:当前关于梯度下降训练神经网络收敛性的前沿分析聚焦于损失景观特性的刻画,如Polyak-Лояциевич(PL)条件和限制性强凸性。尽管梯度下降在这些条件下具有线性收敛性,但在类似设定与假设下Nesterov动量能否实现加速收敛仍是一个未解问题。本文考虑一类新的目标函数,其中仅部分参数满足强凸性,并理论证明Nesterov动量在该目标函数类中可实现加速。我们提供了该问题类的两种具体实现,其中之一为深度ReLU网络——据我们所知,这使其成为首个证明非平凡神经网络架构具有加速收敛速率的研究。