We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.
翻译:我们考虑深度学习(DL)网络中的监督学习场景,并利用黎曼度量的任意选择来定义梯度下降流(这是微分几何中的一个普遍事实)。在标准的深度学习方法中,参数(权重和偏置)空间上的梯度流是相对于欧几里得度量定义的。本文则选择在深度学习网络的输出层中相对于欧几里得度量定义梯度流。这自然诱导出参数空间中两种改进的梯度下降流形式:一种适用于过参数化场景,另一种适用于欠参数化场景。在过参数化情况下,我们证明只要满足秩条件,改进梯度下降的所有轨道都能以均匀指数收敛速率将${\mathcal L}^2$代价函数驱动至全局最小值;由此可获得任意预设接近全局最小值程度的先验停止时间。我们指出了该方法与次黎曼几何的关联。此外,我们将上述框架推广到秩条件不成立的情形:特别证明局部平衡点仅可能在秩缺失时出现,且通常不是孤立点,而是参数空间临界子流形中的元素。