In this paper, we present some theoretical work to explain why simple gradient descent methods are so successful in solving non-convex optimization problems in learning large-scale neural networks (NN). After introducing a mathematical tool called canonical space, we have proved that the objective functions in learning NNs are convex in the canonical model space. We further elucidate that the gradients between the original NN model space and the canonical space are related by a pointwise linear transformation, which is represented by the so-called disparity matrix. Furthermore, we have proved that gradient descent methods surely converge to a global minimum of zero loss provided that the disparity matrices maintain full rank. If this full-rank condition holds, the learning of NNs behaves in the same way as normal convex optimization. At last, we have shown that the chance to have singular disparity matrices is extremely slim in large NNs. In particular, when over-parameterized NNs are randomly initialized, the gradient decent algorithms converge to a global minimum of zero loss in probability.
翻译:本文通过理论工作解释了为何简单梯度下降法在求解大规模神经网络(NN)学习中的非凸优化问题时如此成功。在引入称为“规范空间”的数学工具后,我们证明了神经网络学习的目标函数在规范模型空间中具有凸性。我们进一步阐明,原始神经网络模型空间与规范空间之间的梯度通过逐点线性变换相关联,该变换由所谓的“差异矩阵”表示。此外,我们证明了若差异矩阵保持满秩,则梯度下降法必然收敛到零损失的全局最小值。当此满秩条件成立时,神经网络学习行为与常规凸优化完全一致。最后,我们指出在大规模神经网络中出现奇异差异矩阵的概率极低。特别地,当过参数化神经网络随机初始化时,梯度下降算法以概率形式收敛到零损失的全局最小值。