We study gradient flows for loss landscapes of fully connected feedforward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to more realistic scenarios, where we observe an analogous behavior.
翻译:本研究针对具有常用连续可微激活函数(如逻辑函数、双曲正切函数、softplus或GELU函数)的全连接前馈神经网络,分析其损失函数景观的梯度流动态。我们证明梯度流要么收敛到临界点,要么发散至无穷大,同时损失函数收敛于渐近临界值。此外,我们证明存在阈值$\varepsilon>0$,使得任何初始损失值至多高于最优水平$\varepsilon$的梯度流,其损失值都将收敛至该最优水平。对于多项式目标函数及足够大的网络架构与数据集,我们证明最优损失值为零且仅能渐近实现。基于此设定,我们推导出主要结论:任何具有足够优质初始化的梯度流都将发散至无穷大。本证明深度依赖于o-极小结构的几何性质。我们通过数值实验验证了这些理论发现,并将研究拓展至更现实的场景,在其中观察到类似的行为模式。