Overparameterized fully-connected neural networks have been shown to behave like kernel models when trained with gradient descent, assuming standard scaling conditions on the width, the learning rate, and the parameter initialization. In the limit of infinitely large widths and infinitesimal learning rate, the obtained kernel provides a description of the learned model's output via a closed-form solution dependent on the architecture and the activation function. The Neural Tangent Kernel, central to this description, remains constant throughout training, a phenomenon that is referred to as ``lazy training'' or within the ``lazy regime''. Prior works show that the ``lazy regime'' leads to non-varying hidden neuron activations in infinitely-wide networks. Moreover, as infinitely-wide networks increase in depth, the Neural Tangent Kernel induces a closed-form solution that is data-independent, hence trivial. The Neural Tangent Kernel seemingly fails to describe the complexity of overparameterized neural networks on two distinct axes: large widths and large depths. In this work, we challenge these two conclusions and open the door to re-evaluating the Neural Tangent Kernel's role in describing the output of overparameterized neural networks. Specifically, we show experimentally that while deviations in the activations of individual hidden neurons vanish, the aggregate norm of these deviations does not. We support this finding with a theoretical result showing that the activations of the last hidden layer do not remain constant. Furthermore, we demonstrate that properly scaling the depth and stopping time in infinitely-wide ReLU networks yields a well-behaved, non-trivial output at large dataset sizes. We empirically evaluate the stability of this behavior on large datasets, and we describe the essential properties that enable the generalization of our results to other kernels.
翻译:暂无翻译