Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.
翻译:将权重初始化为独立高斯分布的全连接深度神经网络可调节至临界状态,从而阻止信号在网络传播中的指数增长或衰减。然而,此类网络仍存在随深度线性增长的波动,这可能阻碍宽度与深度相当的网络的训练。我们通过解析证明,采用tanh激活函数且权重初始化为正交矩阵系综的矩形网络,其对应的预激活波动在宽度倒数的首阶近似下与深度无关。进一步,我们通过数值模拟表明,在初始化阶段,所有涉及神经正切核(NTK)及其衍生量(在宽度倒数的首阶近似下控制训练过程中可观测量的演化)的相关因子在深度约为20时趋于饱和,而非如高斯初始化情形那样无限增长。我们推测这种结构在降低整体噪声的同时保留了有限宽度下的特征学习能力,从而同时提升泛化性能与训练速度。通过将NTK经验测量值与在MNIST和CIFAR-10分类任务上采用全批次梯度下降训练的深度非线性正交网络的优越性能相关联,我们提供了部分实验验证。