We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry.
翻译:本文针对深度学习网络中广泛用于 $\mathcal{L}^2$ 代价函数最小化的梯度下降流进行了研究,并引入了两种改进版本:一种适用于过参数化场景,另一种适用于欠参数化场景。两种版本均具有清晰且自然的几何意义,分别考虑了过参数化场景下的拉回向量丛结构和欠参数化场景下的推前向量丛结构。在过参数化情况下,我们证明,若秩条件成立,则改进梯度下降的所有轨道都能以均匀指数收敛速率将 $\mathcal{L}^2$ 代价驱动至全局最小值;由此可获得关于任意预设全局最小值的邻近范围的先验停止时间。我们指出了后一性质与子黎曼几何之间的联系。