We introduce a general theoretical framework, designed for the study of gradient optimisation of deep neural networks, that encompasses ubiquitous architecture choices including batch normalisation, weight normalisation and skip connections. Our framework determines the curvature and regularity properties of multilayer loss landscapes in terms of their constituent layers, thereby elucidating the roles played by normalisation layers and skip connections in globalising these properties. We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.
翻译:我们引入了一个通用的理论框架,旨在研究深度神经网络的梯度优化问题,该框架涵盖了包括批归一化、权重归一化和跳跃连接在内的常见架构选择。该框架通过各组成层的性质确定多层损失景观的曲率和正则性,从而阐明了归一化层和跳跃连接在全局化这些性质中所起的作用。随后,我们从两个方面展示了该框架的实用性。首先,我们提供了已知唯一的证明,表明即使在最优解仅存在于无穷远处时(如交叉熵损失函数的情形),一类深度神经网络仍可通过梯度下降训练达到全局最优解。其次,我们识别了一种新的因果机制,解释了跳跃连接如何加速训练,并通过ResNet在MNIST、CIFAR10、CIFAR100和ImageNet数据集上的预测性验证证实了这一点。