We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry.
翻译:本文考虑深度学习中广泛用于最小化$\mathcal{L}^2$代价函数的梯度下降流,并提出两种改进版本:一种适用于过参数化情形,另一种适用于欠参数化情形。两种方法均具有清晰且自然的几何意义:在过参数化情形中引入拉回向量丛结构,在欠参数化情形中引入前推向量丛结构。对于过参数化情形,我们证明:若秩条件成立,则改进梯度下降的所有轨道均能以均匀指数收敛速率将$\mathcal{L}^2$代价驱动至全局最小值;由此可预先确定达到任意给定全局最小值邻域所需的停止时间。我们指出后者与亚黎曼几何的联系。