We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.
翻译:我们考虑深度学习(DL)网络中的监督学习场景,并利用黎曼度量选择的任意性来定义梯度下降流(这是微分几何的一个普遍事实)。在DL的标准方法中,参数空间(权重和偏置)上的梯度流是基于欧几里得度量定义的。本文则选择基于DL网络输出层欧几里得度量的梯度流。这自然引出了参数空间中梯度下降流的两种修正版本:一种适用于过参数化情形,另一种适用于欠参数化情形。在过参数化情况下,我们证明:若秩条件成立,修正梯度下降的所有轨道均以均匀指数收敛速率将${\mathcal L}^2$代价驱动至其全局最小值;由此可预先确定任意指定逼近全局最小值的终止时间。我们指出该方法与子黎曼几何的联系。此外,我们将上述框架推广至秩条件不成立的情形:特别地,我们证明局部平衡态仅当秩损失发生时存在,且一般而言,这些平衡态并非孤立点,而是参数空间中临界子流形的元素。