Gradient descent (GD) on deep neural network loss landscapes is non-convex, yet often converges far faster in practice than classical guarantees suggest. Prior work shows that within locally quasi-convex regions (LQCRs), GD converges to stationary points at sublinear rates, leaving the commonly observed near-exponential training dynamics unexplained. We show that, under a mild local Neural Tangent Kernel (NTK) stability assumption, the loss satisfies a PL-type error bound within these regions, yielding a Locally Polyak-Lojasiewicz Region (LPLR) in which the squared gradient norm controls the suboptimality gap. For properly initialized finite-width networks, we show that under local NTK stability this PL-type mechanism holds around initialization and establish linear convergence of GD as long as the iterates remain within the resulting LPLR. Empirically, we observe PL-like scaling and linear-rate loss decay in controlled full-batch training and in a ResNet-style CNN trained with mini-batch SGD on a CIFAR-10 subset, indicating that LPLR signatures can persist under modern architectures and stochastic optimization. Overall, the results connect local geometric structure, local NTK stability, and fast optimization rates in a finite-width setting.
翻译:深度神经网络损失函数上的梯度下降(GD)是非凸优化问题,但在实践中其收敛速度往往远快于经典理论保证。先前研究表明,在局部拟凸区域(LQCRs)内,GD以次线性速率收敛至驻点,这无法解释实际中常见的近指数级训练动态。我们证明,在温和的局部神经正切核(NTK)稳定性假设下,损失函数在这些区域内满足PL型误差界,从而形成局部Polyak-Lojasiewicz区域(LPLR),其中梯度平方范数控制次优间隙。对于适当初始化的有限宽度网络,我们证明在局部NTK稳定性条件下,这种PL型机制在初始化点附近成立,并确立只要迭代保持在所得LPLR内,GD即具有线性收敛性。实证中,我们在受控全批次训练以及使用小批次SGD在CIFAR-10子集上训练的ResNet风格CNN中,观察到类PL标度律和线性速率的损失衰减,表明LPLR特征在现代架构和随机优化下仍可保持。总体而言,本研究在有限宽度设定下建立了局部几何结构、局部NTK稳定性与快速优化速率之间的理论联系。