We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.
翻译:本文考虑深度学习网络中广泛用于 $\mathcal{L}^2$ 代价函数最小化的梯度下降流,并提出两种修正版本:一种适用于过参数化场景,另一种适用于欠参数化场景。两种方法均具有清晰且自然的几何不变性,其中过参数化情形涉及拉回向量丛结构,欠参数化情形涉及前推向量丛结构。在过参数化情况下,我们证明:若秩条件成立,则修正梯度下降的所有轨道均以均匀指数收敛速率将 $\mathcal{L}^2$ 代价驱动至全局最小值。最后指出该方法与亚黎曼几何之间的关联。