We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.
翻译:本文研究梯度流对大规模浅层神经网络的优化问题,其中每个隐藏节点的输出通过某个正参数进行缩放。我们重点考察节点缩放参数非相同的情况,这与经典神经正切核参数化存在差异。我们证明,对于大规模神经网络,高概率下梯度流能够收敛到全局最小值,并且能够学习特征,这与神经正切核范式不同。我们还提供了在合成数据集和真实世界数据集上的实验,以验证理论结果,并展示此类缩放方法在剪枝和迁移学习方面的优势。