Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles, namely mutual alignment, unlocking and racing, that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
翻译:摘要:我们对神经网络的理论理解滞后于其实证成功。一个重要的未解之谜是,在梯度下降训练过程中,神经网络的“理论容量”为何以及如何被压缩为适应任务的“有效容量”。本文通过分析单隐层ReLU网络中单个神经元层次上的学习动力学,探究梯度下降实现这一过程的机制。我们识别出三种动力学原理——相互对齐、解锁与赛跑——它们共同解释了为何在训练后,通过合并等效神经元或剪除低范数权重,我们通常能够成功降低容量。我们具体阐释了彩票假设背后的机理,即某些神经元特定的、有利的初始条件如何使其获得更高的权重范数。