Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018) . For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.
翻译:使用一阶优化方法训练神经网络是深度学习在实证上取得成功的关键。初始化尺度是一个关键因素,较小的初始化通常与特征学习机制相关,此时梯度下降隐式偏向于简单解。本研究对早期对齐阶段进行了全面且定量的描述,该概念最初由Maennel等人(2018)提出。对于小初始化与单隐层ReLU网络,训练动态的早期阶段会导致神经元朝关键方向对齐。这种对齐诱导出网络的稀疏表示,该表示直接关联梯度流在收敛时的隐式偏差。然而,这种诱导稀疏性的对齐是以增加训练目标最小化难度为代价的:本文同时提供了一个简单数据示例,其中过参数化网络无法收敛至全局最小值,而是收敛到一个伪平稳点。